Releases: UKPLab/sentence-transformers
v5.1.0 - ONNX and OpenVINO backends offering 2-3x speedups; more hard negatives mining formats
This release introduces 2 new efficient computing backends for SparseEncoder embedding models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; a new "n-tuple-score" output format for hard negative mining for distillation; gathering across devices for free lunch on multi-gpu training; trackio support; MTEB documentation; any many small fixes and features.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==5.1.0
# Inference only, use one of:
pip install sentence-transformers==5.1.0
pip install sentence-transformers[onnx-gpu]==5.1.0
pip install sentence-transformers[onnx]==5.1.0
pip install sentence-transformers[openvino]==5.1.0
Faster ONNX and OpenVINO backends for SparseEncoder models (#3475)
Introducing a new backend
keyword argument to the SparseEncoder
initialization, allowing values of "torch"
(default), "onnx"
, and "openvino"
.
These require installing sentence-transformers
with specific extras:
pip install sentence-transformers[onnx-gpu]
# or ONNX for CPU only:
pip install sentence-transformers[onnx]
# or
pip install sentence-transformers[openvino]
It's as simple as:
from sentence_transformers import SparseEncoder
# Load a SparseEncoder model with the ONNX backend
model = SparseEncoder("naver/splade-v3", backend="onnx")
query = "Which planet is known as the Red Planet?"
documents = [
"Venus is often called Earth's twin because of its similar size and proximity.",
"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
"Jupiter, the largest planet in our solar system, has a prominent red spot.",
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
query_embeddings = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# torch.Size([30522]) torch.Size([4, 30522])
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[12.1450, 26.1040, 22.0025, 23.3877]])
decoded_query = model.decode(query_embeddings, top_k=5)
decoded_documents = model.decode(document_embeddings, top_k=5)
print(decoded_query)
# [('red', 3.0222), ('planet', 2.5001), ('planets', 1.9412), ('known', 1.8126), ('nasa', 0.9347)]
print(decoded_documents)
# [
# [('venus', 3.1980), ('twin', 2.7036), ('earth', 2.4310), ('twins', 2.0957), ('planet', 1.9462)],
# [('mars', 3.1443), ('planet', 2.4924), ('red', 2.4514), ('reddish', 2.2234), ('planets', 2.1976)],
# [('jupiter', 2.9604), ('red', 2.5507), ('planet', 2.3774), ('planets', 2.1641), ('spot', 2.1138)],
# [('saturn', 2.9354), ('red', 2.4548), ('planet', 2.3962), ('mistaken', 2.3361), ('cass', 2.2100)]
# ]
If you specify a backend
and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub
or model.save_pretrained
into the same model repository or directory to avoid having to re-export the model every time.
All keyword arguments passed via model_kwargs
will be passed on to ORTModelForMaskedLM.from_pretrained
or ORTModelForMaskedLM.from_pretrained
. The most useful arguments are:
provider
: (Only ifbackend="onnx"
) ONNX Runtime provider to use for loading the model, e.g."CPUExecutionProvider"
. See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g."CUDAExecutionProvider"
) will be used.file_name
: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.export
: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.
Benchmarks
We ran benchmarks for CPU and GPU, averaging findings across 3 datasets, and numerous batch sizes. Here are the findings:
These findings resulted in these recommendations:
For GPU, you can expect 1.81x speedup with bf16 at no cost, and for CPU you can expect up to ~3x speedup at minimal cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.
Read the Speeding up Inference documentation for more details.
New n-tuple-scores
output format from mine_hard_negatives
(#3430, #3481)
The mine_hard_negatives
utility function has been extended to support the n-tuple-scores
output format, which outputs negatives into num_negatives
+ 3 columns:
- 'query', 'answer', 'negative_1', 'negative_2', ..., 'score'
where the 'score' is a list of scores for the query-answer plus each query-negative pair.
from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
# Load a Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")
# Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
# Mine hard negatives into num_negatives + 3 columns: 'query', 'answer', 'negative_1', 'negative_2', ..., 'score'
# where 'score' is a list of scores for the query-answer plus each query-negative pair.
dataset = mine_hard_negatives(
dataset=dataset,
model=model,
num_negatives=5,
sampling_strategy="top",
batch_size=128,
use_faiss=True,
output_format="n-tuple-scores",
)
print(dataset)
print(dataset[14])
"""
{
'query': 'when did jack and the beanstalk take place',
'answer': "Jack and the Beanstalk According to researchers at the universities in Durham and Lisbon, the story originated more than 5,000 years ago, based on a widespread archaic story form which is now classified by folklorists as ATU 328 The Boy Who Stole Ogre's Treasure.[7]",
'negative_1': 'Jack and the Beanstalk "Jack and the Beanstalk" is an English fairy tale. It appeared as "The Story of Jack Spriggins and the Enchanted Bean" in 1734[1] and as Benjamin Tabart\'s moralised "The History of Jack and the Bean-Stalk" in 1807.[2] Henry Cole, publishing under pen name Felix Summerly popularised the tale in The Home Treasury (1845),[3] and Joseph Jacobs rewrote it in English Fairy Tales (1890).[4] Jacobs\' version is most commonly reprinted today and it is believed to be closer to the oral versions than Tabart\'s because it lacks the moralising.[5]',
'negative_2': 'Jack and the Beanstalk Jack climbs the beanstalk twice more. He learns of other treasures and steals them when the giant sleeps: first a goose that lays golden eggs, then a magic harp that plays by itself. The giant wakes when Jack leaves the house with the harp and chases Jack down the beanstalk. Jack calls to his mother for an axe and before the giant reaches the ground, cuts down the beanstalk, causing the giant to fall to his death.',
'negative_3': 'Jack in the Box Jack in the Box is an American fast-food restaurant chain founded February 21, 1951, by Robert O. Peterson in San Diego, California, where it is headquartered. The chain has 2,200 locations, primarily serving the West Coast of the United States and selected large urban areas in the eastern portion of the US including Texas. Food items include a variety of hamburger and cheeseburger sandwiches along with selections of internationally themed foods such as tacos and egg rolls. The company also operates the Qdoba Mexican Grill chain.[4][5]',
'negative_4': 'Jack in the Box Jack in the Box is an American fast-food restaurant chain founded February 21, 1951, by Robert O. Peterson in San Diego, California, where it is headquartered. The chain has 2,200 locations, primarily serving the West Coast of the United States and selected large urban areas in the eastern portion of the US including Texas and the Charlotte metropolitan area. The company also formerly operated the Qdoba Mexican Grill chain until Apollo Global Management bought the chain in December 2017.[4]',
'negative_5': "Jack Box Jack Box (full name Jack I. Box; or simply known as Jack) is the mascot of American restaurant chain Jack in the Box. In the advertisements, he is the founder, CEO, and ad spokesman for the chain. According to the company's web site, he has the appearance of a typical male, with the exception of his huge spherical white head, blue dot eyes, conical black pointed nose, and a curvilinear red smile. He is most of the time seen wearing his yellow clown cap, and a business suit driving a red Viper convertible.",
'score': [0.7949077486991882, 0.8010389804840088, 0.646654963493347...
v5.0.0 - SparseEncoder support; encode_query & encode_document; multi-processing in encode; Router; and more
This release consists of significant updates including the introduction of Sparse Encoder models, new methods encode_query
and encode_document
, multi-processing support in encode
, the Router
module for asymmetric models, custom learning rates for parameter groups, composite loss logging, and various small improvements and bug fixes.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==5.0.0
# Inference only, use one of:
pip install sentence-transformers==5.0.0
pip install sentence-transformers[onnx-gpu]==5.0.0
pip install sentence-transformers[onnx]==5.0.0
pip install sentence-transformers[openvino]==5.0.0
Tip
Our Training and Finetuning Sparse Embedding Models with Sentence Transformers v5 blogpost is an excellent place to learn about finetuning sparse embedding models!
Note
This release is designed to be fully backwards compatible, meaning that you should be able to upgrade from older versions to v5.x without any issues. If you are running into issues when upgrading, feel free to open an issue. Also see the Migration Guide for changes that we would recommend.
Sparse Encoder models
The Sentence Transformers v5.0 release introduces Sparse Embedding models, also known as Sparse Encoders. These models generate high-dimensional embeddings, often with 30,000+ dimensions, where often only <1% of dimensions are non-zero. This is in contrast to the standard dense embedding models, which produce low-dimensional embeddings (e.g., 384, 768, or 1024 dimensions) where all values are non-zero.
Usually, each active dimension (i.e. the dimension with a non-zero value) in a sparse embedding corresponds to a specific token in the model's vocabulary, allowing for interpretability. This means that you can e.g. see exactly which words/tokens are important in an embedding, and that you can inspect exactly because of which words/tokens two texts are deemed similar.
Let's have a look at naver/splade-v3, a strong sparse embedding model, as an example:
from sentence_transformers import SparseEncoder
# Download from the π€ Hub
model = SparseEncoder("naver/splade-v3")
# Run inference
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 30522)
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 32.4323, 5.8528, 0.0258],
# [ 5.8528, 26.6649, 0.0302],
# [ 0.0258, 0.0302, 24.0839]])
# Let's decode our embeddings to be able to interpret them
decoded = model.decode(embeddings, top_k=10)
for decoded, sentence in zip(decoded, sentences):
print(f"Sentence: {sentence}")
print(f"Decoded: {decoded}")
print()
Sentence: The weather is lovely today.
Decoded: [('weather', 2.754288673400879), ('today', 2.610959529876709), ('lovely', 2.431990623474121), ('currently', 1.5520408153533936), ('beautiful', 1.5046082735061646), ('cool', 1.4664798974990845), ('pretty', 0.8986214995384216), ('yesterday', 0.8603134155273438), ('nice', 0.8322536945343018), ('summer', 0.7702118158340454)]
Sentence: It's so sunny outside!
Decoded: [('outside', 2.6939032077789307), ('sunny', 2.535827398300171), ('so', 2.0600898265838623), ('out', 1.5397940874099731), ('weather', 1.1198079586029053), ('very', 0.9873268604278564), ('cool', 0.9406591057777405), ('it', 0.9026399254798889), ('summer', 0.684999406337738), ('sun', 0.6520509123802185)]
Sentence: He drove to the stadium.
Decoded: [('stadium', 2.7872302532196045), ('drove', 1.8208855390548706), ('driving', 1.6665740013122559), ('drive', 1.5565159320831299), ('he', 1.4721972942352295), ('stadiums', 1.449463129043579), ('to', 1.0441515445709229), ('car', 0.7002660632133484), ('visit', 0.5118278861045837), ('football', 0.502326250076294)]
In this example, the embeddings are 30,522-dimensional vectors, where each dimension corresponds to a token in the model's vocabulary. The decode
method returned the top 10 tokens with the highest values in the embedding, allowing us to interpret which tokens contribute most to the embedding.
We can even determine the intersection or overlap between embeddings, very useful for determining why two texts are deemed similar or dissimilar:
# Let's also compute the intersection/overlap of the first two embeddings
intersection_embedding = model.intersection(embeddings[0], embeddings[1])
decoded_intersection = model.decode(intersection_embedding)
print(decoded_intersection)
Decoded: [('weather', 3.0842742919921875), ('cool', 1.379457712173462), ('summer', 0.5275946259498596), ('comfort', 0.3239051103591919), ('sally', 0.22571465373039246), ('julian', 0.14787325263023376), ('nature', 0.08582140505313873), ('beauty', 0.0588383711874485), ('mood', 0.018594780936837196), ('nathan', 0.000752730411477387)]
And if we think the embeddings are too big, we can limit the maximum number of active dimensions like so:
from sentence_transformers import SparseEncoder
# Download from the π€ Hub
model = SparseEncoder("naver/splade-v3") # You can also set max_active_dims here instead of encode()
# Run inference
documents = [
"UV-A light, specifically, is what mainly causes tanning, skin aging, and cataracts, UV-B causes sunburn, skin aging and skin cancer, and UV-C is the strongest, and therefore most effective at killing microorganisms. Again Γ’\x80\x93 single words and multiple bullets.",
"Answers from Ronald Petersen, M.D. Yes, Alzheimer's disease usually worsens slowly. But its speed of progression varies, depending on a person's genetic makeup, environmental factors, age at diagnosis and other medical conditions. Still, anyone diagnosed with Alzheimer's whose symptoms seem to be progressing quickly Γ’\x80\x94 or who experiences a sudden decline Γ’\x80\x94 should see his or her doctor.",
"Bell's palsy and Extreme tiredness and Extreme fatigue (2 causes) Bell's palsy and Extreme tiredness and Hepatitis (2 causes) Bell's palsy and Extreme tiredness and Liver pain (2 causes) Bell's palsy and Extreme tiredness and Lymph node swelling in children (2 causes)",
]
embeddings = model.encode_document(documents, max_active_dims=64)
print(embeddings.shape)
# (3, 30522)
# Print the sparsity of the embeddings
sparsity = model.sparsity(embeddings)
print(sparsity)
# {'active_dims': 64.0, 'sparsity_ratio': 0.9979031518249132}
Click to see that it has minimal impact on scores
from sentence_transformers import SparseEncoder
# Download from the π€ Hub
model = SparseEncoder("naver/splade-v3") # You can also set max_active_dims here instead of encode()
# Run inference
queries = ["what causes aging fast"]
documents = [
"UV-A light, specifically, is what mainly causes tanning, skin aging, and cataracts, UV-B causes sunburn, skin aging and skin cancer, and UV-C is the strongest, and therefore most effective at killing microorganisms. Again Γ’\x80\x93 single words and multiple bullets.",
"Answers from Ronald Petersen, M.D. Yes, Alzheimer's disease usually worsens slowly. But its speed of progression varies, depending on a person's genetic makeup, environmental factors, age at diagnosis and other medical conditions. Still, anyone diagnosed with Alzheimer's whose symptoms seem to be progressing quickly Γ’\x80\x94 or who experiences a sudden decline Γ’\x80\x94 should see his or her doctor.",
"Bell's palsy and Extreme tiredness and Extreme fatigue (2 causes) Bell's palsy and Extreme tiredness and Hepatitis (2 causes) Bell's palsy and Extreme tiredness and Liver pain (2 causes) Bell's palsy and Extreme tiredness and Lymph node swelling in children (2 causes)",
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
# Determine the sparsity
query_sparsity = model.sparsity(query_embeddings)
document_sparsity = model.sparsity(document_embeddings)
print(query_sparsity, document_sparsity)
# {'active_dims': 28.0, 'sparsity_ratio': 0.9990826289233995} {'active_dims': 174.6666717529297, 'sparsity_ratio': 0.9942773516888497}
# Calculate the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[11.3767, 10.8296, 4.3457]], device='cuda:0')
# Again with smaller max_active_dims
smaller_document_embeddings = model.encode_document(documents, max_active_dims=64)
# Determine the sparsity for the smaller document embeddings
smaller_document_sparsity = model.sparsity(smaller_document_embeddings)
print(query_sparsity, smaller_document_sparsity)
# {'active_dims': 28.0, 'sparsity_ratio': 0.9990826289233995} {'active_dims': 64.0, 'sparsity_ratio': 0.9979031518249132}
# Print the similarity scores for the smaller document embeddings
smaller_similarities = model.similarity(query_embeddings, smaller_document_embeddings)
print(smaller_similarities)
# tensor([[10.1311, 9.8360, 4.3457]], device='cuda:0')
# Very similar to the scores for the full document embeddings!
Are they any good?
A big question is: How do sparse embedding models stack up against the βstandardβ dense embedding models, and what kind of performance can you expect when combining various?
For this, I ran a variation of our hybrid_search.py evaluation script, with:
- TheΒ [Nano...
v4.1.0 - ONNX and OpenVINO backends offering 2-3x speedups; improved hard negatives mining
This release introduces 2 new efficient computing backends for CrossEncoder (reranker) models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; improved hard negatives mining strategies, and minor improvements.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==4.1.0
# Inference only, use one of:
pip install sentence-transformers==4.1.0
pip install sentence-transformers[onnx-gpu]==4.1.0
pip install sentence-transformers[onnx]==4.1.0
pip install sentence-transformers[openvino]==4.1.0
Faster ONNX and OpenVINO Backends for CrossEncoder (#3319)
Introducing a new backend
keyword argument to the CrossEncoder
initialization, allowing values of "torch"
(default), "onnx"
, and "openvino"
.
These require installing sentence-transformers
with specific extras:
pip install sentence-transformers[onnx-gpu]
# or ONNX for CPU only:
pip install sentence-transformers[onnx]
# or
pip install sentence-transformers[openvino]
It's as simple as:
from sentence_transformers import CrossEncoder
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
query = "Which planet is known as the Red Planet?"
passages = [
"Venus is often called Earth's twin because of its similar size and proximity.",
"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
"Jupiter, the largest planet in our solar system, has a prominent red spot.",
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
scores = model.predict([(query, passage) for passage in passages])
print(scores)
If you specify a backend
and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub
or model.save_pretrained
into the same model repository or directory to avoid having to re-export the model every time.
All keyword arguments passed via model_kwargs
will be passed on to ORTModelForSequenceClassification.from_pretrained
or OVModelForSequenceClassification.from_pretrained
. The most useful arguments are:
provider
: (Only ifbackend="onnx"
) ONNX Runtime provider to use for loading the model, e.g."CPUExecutionProvider"
. See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g."CUDAExecutionProvider"
) will be used.file_name
: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.export
: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.
For example:
from sentence_transformers import CrossEncoder
model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L6-v2",
backend="onnx",
model_kwargs={
"file_name": "model_O3.onnx",
"provider": "CPUExecutionProvider",
}
)
query = "Which planet is known as the Red Planet?"
passages = [
"Venus is often called Earth's twin because of its similar size and proximity.",
"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
"Jupiter, the largest planet in our solar system, has a prominent red spot.",
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
scores = model.predict([(query, passage) for passage in passages])
print(scores)
Benchmarks
We ran benchmarks for CPU and GPU, averaging findings across 4 models of various sizes, 3 datasets, and numerous batch sizes. Here are the findings:
These findings resulted in these recommendations:
For GPU, you can expect 1.88x speedup with fp16 at no cost, and for CPU you can expect ~3x speedup at no cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.
Read the Speeding up Inference documentation for more details.
ONNX & OpenVINO Optimization and Quantization
In addition to exporting default ONNX and OpenVINO models, you can also use one of the helper methods for optimizing and quantizing ONNX models:
ONNX Optimization
export_optimized_onnx_model
: This function uses Optimum to implement several optimizations in the ONNX model, ranging from basic optimizations to approximations and mixed precision. Read about the 4 default options here. This function accepts:
model
A SentenceTransformer or CrossEncoder model loaded withbackend="onnx"
.optimization_config
: "O1", "O2", "O3", or "O4" from π€ Optimum or a customOptimizationConfig
instance.model_name_or_path
: The directory or model repository where the optimized model will be saved.push_to_hub
: Whether the push the exported model to the hub withmodel_name_or_path
as the repository name. If False, the model will be saved in the directory specified withmodel_name_or_path
.create_pr
: Ifpush_to_hub
, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for optimizing models of repositories that you don't have write access to.file_suffix
: The suffix to add to the optimized model file name. Will use theoptimization_config
string or"optimized"
if not set.
The usage is like this:
from sentence_transformers import SentenceTransformer, export_optimized_onnx_model
onnx_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
export_optimized_onnx_model(
model=onnx_model,
optimization_config="O4",
model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2",
push_to_hub=True,
create_pr=True,
)
After which you can load the model with:
from sentence_transformers import CrossEncoder
pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L6-v2",
backend="onnx",
model_kwargs={"file_name": "onnx/model_O4.onnx"},
revision=f"refs/pr/{pull_request_nr}"
)
or when it gets merged:
from sentence_transformers import CrossEncoder
model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L6-v2",
backend="onnx",
model_kwargs={"file_name": "onnx/model_O4.onnx"},
)
ONNX Quantization
export_dynamic_quantized_onnx_model
: This function uses Optimum to quantize the ONNX model to int8, also allowing for hardware-specific optimizations. This results in impressive speedups for CPUs. In my findings, each of the default quantization configuration options gave approximately the same performance improvements. This function accepts
model
A SentenceTransformer or CrossEncoder model loaded withbackend="onnx"
.quantization_config
: "arm64", "avx2", "avx512", or "avx512_vnni" representing quantization configurations from AutoQuantizationConfig, or an QuantizationConfig instance.model_name_or_path
: The directory or model repository where the optimized model will be saved.push_to_hub
: Whether the push the exported model to the hub withmodel_name_or_path
as the repository name. If False, the model will be saved in the directory specified withmodel_name_or_path
.create_pr
: Ifpush_to_hub
, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for quantizing models of repositories that you don't have write access to.file_suffix
: The suffix to add to the optimized model file name. Will use thequantization_config
string or e.g."int8_quantized"
if not set.
The usage is like this:
from sentence_transformers import CrossEncoder, export_dynamic_quantized_onnx_model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
export_dynamic_quantized_onnx_model(
mod...
v4.0.2 - Safer reranker max sequence length logic, typing issues, FSDP & device placement
This patch release updates some logic for maximum sequence lengths, typing issues, FSDP training, and distributed training device placement.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==4.0.2
# Inference only, use one of:
pip install sentence-transformers==4.0.2
pip install sentence-transformers[onnx-gpu]==4.0.2
pip install sentence-transformers[onnx]==4.0.2
pip install sentence-transformers[openvino]==4.0.2
Safer CrossEncoder (reranker) maximum sequence length
When loading CrossEncoder
models, we now rely on the minimum of the tokenizer model_max_length
and the config max_position_embeddings
(if they exist), rather than only relying on the latter if it exists. This previously resulted in the maximum sequence length of BAAI/bge-reranker-base being 514, whereas it can only handle sequences up to 512 tokens.
from sentence_transformers import CrossEncoder
model = CrossEncoder("BAAI/bge-reranker-base")
print(model.max_length)
# => 512
# The texts for which to predict similarity scores
query = "How many people live in Berlin?"
passages = [
"Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
"In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
]
scores = model.predict([(query, passage) for passage in passages])
print(scores)
# => [0.99953485 0.01062613]
# Or test really long inputs to ensure that there's no crash:
score = model.predict([["one " * 1000, "two " * 1000]])
print(score)
# => [0.95482624]
Note that you can use the activation_fn
option with torch.nn.Identity()
to avoid the default Sigmoid that maps everything to [0, 1]:
from sentence_transformers import CrossEncoder
import torch
model = CrossEncoder("BAAI/bge-reranker-base", activation_fn=torch.nn.Identity())
# The texts for which to predict similarity scores
query = "How many people live in Berlin?"
passages = [
"Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
"In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
]
scores = model.predict([(query, passage) for passage in passages])
print(scores)
# => [ 7.672551 -4.5337563]
Default device placement (#3303)
By default, in a distributed training setup with multiple CUDA devices, the model is now placed on the CUDA device corresponding with that local rank. This should lower the VRAM usage on GPU 0 when performing distributed training.
Minor patches of note
- Resolved typing issues for
SentenceTransformer
class outside of theencode
method. In v4.0.1, it was possible to no longer get help from your IDE for e.g.model.similarity
, for example. (#3297) - Improve FSDP training compatibility by avoiding a faulty "only if model is wrapped"-check. Now, the wrapped model should always be laced in the
loss
class instance when required for FSDP training. (#3295)
All Changes
- [docs]: update examples by @emmanuel-ferdman in #3292
- Update htaccess, in-line comments were problematic by @tomaarsen in #3293
- [
docs
] Resolve more broken links throughout the docs by @tomaarsen in #3294 - [
docs
] Fix some broken docs redirects by @tomaarsen in #3296 - [
typing
] Move encode typings back to .py from .pyi by @tomaarsen in #3297 - [
fix
] Avoid "Only if model is wrapped" check which is faulty for FSDP by @tomaarsen in #3295 - [
cross-encoder
] Set the tokenizer model_max_length to the min. of model_max_length & max_pos_embeds by @tomaarsen in #3304 - [
ci
] Attempt to fix CI by @tomaarsen in #3305 - Fix device assignment in
get_device_name
for distributed training by @uminaty in #3303 - [
docs
] Add missing docstring for push_to_hub by @tomaarsen in #3306 - [
docs
] Specify that exported ONNX/OpenVINO models don't include pooling/normalization by @tomaarsen in #3307
New Contributors
- @emmanuel-ferdman made their first contribution in #3292
- @uminaty made their first contribution in #3303
Full Changelog: v4.0.1...v4.0.2
v4.0.1 - Reranker (Cross Encoder) Training Refactor; new losses, docs, examples, etc.
This release consists of a major refactor that overhauls the reranker a.k.a. Cross Encoder training approach (introducing multi-gpu training, bf16, loss logging, callbacks, and much more), including all new Training Overview, Loss Overview, API Reference docs, training examples and more!
Install this version with
# Training + Inference
pip install sentence-transformers[train]==4.0.1
# Inference only, use one of:
pip install sentence-transformers==4.0.1
pip install sentence-transformers[onnx-gpu]==4.0.1
pip install sentence-transformers[onnx]==4.0.1
pip install sentence-transformers[openvino]==4.0.1
Tip
My Training and Finetuning Reranker Models with Sentence Transformers v4 blogpost is an excellent place to learn 1) why finetuning rerankers makes sense and 2) how you can do it, too!
Reranker (Cross Encoder) training refactor (#3222)
The v4.0 release centers around this huge modernization of the training approach for CrossEncoder
models, following v3.0 which introduced the same for SentenceTransformer
models. Whereas training before v4.0 used to be all about InputExample
, DataLoader
and model.fit
, the new training approach relies on 5 components. You can learn more about these components in our Training and Finetuning Embedding Models with Sentence Transformers v4 blogpost. Additionally, you can read the new Training Overview, check out the Training Examples, or read this summary:
- Dataset
A trainingDataset
orDatasetDict
. This class is much more suited for sharing & efficient modifications than lists/DataLoaders ofInputExample
instances. ADataset
can contain multiple text columns that will be fed in order to the corresponding loss function. So, if the loss expects (anchor, positive, negative) triplets, then your dataset should also have 3 columns. The names of these columns are irrelevant. If there is a "label" or "score" column, it is treated separately, and used as the labels during training.
ADatasetDict
can be used to train with multiple datasets at once, e.g.:When aDatasetDict({ natural_questions: Dataset({ features: ['anchor', 'positive'], num_rows: 392702 }) gooaq: Dataset({ features: ['anchor', 'positive', 'negative'], num_rows: 549367 }) stsb: Dataset({ features: ['sentence1', 'sentence2', 'label'], num_rows: 5749 }) })
DatasetDict
is used, theloss
parameter to theCrossEncoderTrainer
must also be a dictionary with these dataset keys, e.g.:{ 'natural_questions': CachedMultipleNegativesRankingLoss(...), 'gooaq': CachedMultipleNegativesRankingLoss(...), 'stsb': BinaryCrossEntropyLoss(...), }
- Loss Function
A loss function, or a dictionary of loss functions like described above. - Training Arguments
A CrossEncoderTrainingArguments instance, subclass of a TrainingArguments instance. This powerful class controls the specific details of the training. - Evaluator
An optionalSentenceEvaluator
instance. Unlike before, models can now be evaluated both on an evaluation dataset with some loss function and/or aSentenceEvaluator
instance. - Trainer
The newCrossEncoderTrainer
instance based on thetransformers
Trainer
. This instance can be initialized with a CrossEncoder model, a CrossEncoderTrainingArguments class, a SentenceEvaluator, a training and evaluation Dataset/DatasetDict and a loss function/dict of loss functions. Most of these parameters are optional. Once provided, all you have to do is calltrainer.train()
.
Some of the major features that are now implemented include:
- MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
- bf16 training support
- Loss logging
- Evaluation datasets + evaluation loss
- Improved callback support (built-in via Weights and Biases, TensorBoard, CodeCarbon, etc., as well as custom callbacks)
- Gradient checkpointing
- Gradient accumulation
- Improved model card generation
- Warmup ratio
- Pushing to the Hugging Face Hub on every model checkpoint
- Resuming from a training checkpoint
- Hyperparameter Optimization
This script is a minimal example (no evaluator, no training arguments) of training mpnet-base
on a part of the sentence-transformers/hotpotqa
dataset using BinaryCrossEntropyLoss
:
from datasets import load_dataset
from sentence_transformers import CrossEncoder, CrossEncoderTrainer
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss
# 1. Define the model. Either from scratch of by loading a pre-trained model
model = CrossEncoder("microsoft/mpnet-base")
# 2. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/hotpotqa", "triplet", split="train")
def triplet_to_labeled_pair(batch):
anchors = batch["anchor"]
positives = batch["positive"]
negatives = batch["negative"]
return {
"sentence_A": anchors * 2,
"sentence_B": positives + negatives,
"labels": [1] * len(positives) + [0] * len(negatives),
}
dataset = dataset.map(triplet_to_labeled_pair, batched=True, remove_columns=dataset.column_names)
train_dataset = dataset.select(range(10_000))
eval_dataset = dataset.select(range(10_000, 11_000))
# 3. Define a loss function
loss = BinaryCrossEntropyLoss(model)
# 4. Create a trainer & train
trainer = CrossEncoderTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
)
trainer.train()
# 5. Save the trained model
model.save_pretrained("models/mpnet-base-hotpotqa")
# model.push_to_hub("mpnet-base-hotpotqa")
Additionally, trained models now automatically produce extensive model cards. Each of the following models were trained using some script from the Training Examples, and the model cards were not edited manually whatsoever:
- tomaarsen/reranker-MiniLM-L12-gooaq-bce
- tomaarsen/reranker-msmarco-MiniLM-L12-H384-uncased-lambdaloss
- tomaarsen/reranker-distilroberta-base-nli
Prior to the Sentence Transformer v4 release, all reranker models would be trained using the CrossEncoder.fit
method. Rather than deprecating this method, starting from v4.0, this method will use the CrossEncoderTrainer
behind the scenes. This means that your old training code should still work, and should even be upgraded with the new features such as multi-gpu training, loss logging, etc. That said, the new training approach is much more powerful, so it is recommended to write new training scripts using the new approach.
To help you out, all of the Cross Encoder (a.k.a. reranker) training scripts were updated to use the new Trainer-based approach.
Is finetuning worth it?
Finetuning reranker models on your data is very valuable. Consider for example these 2 models that I finetuned on 100k samples from the GooAQ dataset in 30 minutes and 1 hour, respectively. After finetuning, my models heavily outperformed general-purpose reranker models, even though GooAQ is a very generic dataset/domain!
Read my Training and Finetuning Reranker Models with Sentence Transformers v4 blogpost for many more details on these models and how they were trained.
Resources:
- How to use Cross Encoder models? [Cross Encoder > Usage](ht...
v3.4.1 - Model2Vec compatibility & offline model fix
This release introduces a convenient compatibility with Model2Vec models, and fixes a bug that caused an outgoing request even when using a local model.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==3.4.1
# Inference only, use one of:
pip install sentence-transformers==3.4.1
pip install sentence-transformers[onnx-gpu]==3.4.1
pip install sentence-transformers[onnx]==3.4.1
pip install sentence-transformers[openvino]==3.4.1
Full Model2Vec integration
This release introduces support to load an efficient Model2Vec embedding model directly in Sentence Transformers:
from sentence_transformers import SentenceTransformer
# Download from the π€ Hub
model = SentenceTransformer(
"minishlab/potion-base-8M",
device="cpu",
)
# Run inference
sentences = [
'Gadofosveset-enhanced MR angiography of carotid arteries: does steady-state imaging improve accuracy of first-pass imaging?',
'To evaluate the diagnostic accuracy of gadofosveset-enhanced magnetic resonance (MR) angiography in the assessment of carotid artery stenosis, with digital subtraction angiography (DSA) as the reference standard, and to determine the value of reading first-pass, steady-state, and "combined" (first-pass plus steady-state) MR angiograms.',
'In a longitudinal study we investigated in vivo alterations of CVO during neuroinflammation, applying Gadofluorine M- (Gf) enhanced magnetic resonance imaging (MRI) in experimental autoimmune encephalomyelitis, an animal model of multiple sclerosis. SJL/J mice were monitored by Gadopentate dimeglumine- (Gd-DTPA) and Gf-enhanced MRI after adoptive transfer of proteolipid-protein-specific T cells. Mean Gf intensity ratios were calculated individually for different CVO and correlated to the clinical disease course. Subsequently, the tissue distribution of fluorescence-labeled Gf as well as the extent of cellular inflammation was assessed in corresponding histological slices.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings[0], embeddings[1:])
print(similarities)
# tensor([[0.8085, 0.4884]])
Previously, loading a Model2Vec model required you to load a `StaticEmbedding` module.
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
# Download from the π€ Hub
module = StaticEmbedding.from_model2vec("minishlab/potion-base-8M")
model = SentenceTransformer(modules=[module], device="cpu")
# Run inference
sentences = [
'Gadofosveset-enhanced MR angiography of carotid arteries: does steady-state imaging improve accuracy of first-pass imaging?',
'To evaluate the diagnostic accuracy of gadofosveset-enhanced magnetic resonance (MR) angiography in the assessment of carotid artery stenosis, with digital subtraction angiography (DSA) as the reference standard, and to determine the value of reading first-pass, steady-state, and "combined" (first-pass plus steady-state) MR angiograms.',
'In a longitudinal study we investigated in vivo alterations of CVO during neuroinflammation, applying Gadofluorine M- (Gf) enhanced magnetic resonance imaging (MRI) in experimental autoimmune encephalomyelitis, an animal model of multiple sclerosis. SJL/J mice were monitored by Gadopentate dimeglumine- (Gd-DTPA) and Gf-enhanced MRI after adoptive transfer of proteolipid-protein-specific T cells. Mean Gf intensity ratios were calculated individually for different CVO and correlated to the clinical disease course. Subsequently, the tissue distribution of fluorescence-labeled Gf as well as the extent of cellular inflammation was assessed in corresponding histological slices.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings[0], embeddings[1:])
print(similarities)
# tensor([[0.8085, 0.4884]])
Model2Vec was the inspiration of the recent Static Embedding work; all of these models can be used to approach the performance of normal transformer-based embedding models at a fraction of the latency. For example, both Model2Vec and Static Embedding models are ~25x faster than tiny embedding models on a GPU and ~400x faster than those models on a CPU.
Bug Fix
- Using
local_files_only=True
still triggered a request to Hugging Face for the model card metadata; this has been resolved in (#3202).
All Changes
- fix loss name in documentation of CachedMultipleNegativesRankingLoss by @JINO-ROHIT in #3191
- Bump jinja2 from 3.1.4 to 3.1.5 in /docs by @dependabot in #3192
- minor typo in MegaBatchMarginLoss by @JINO-ROHIT in #3193
- Fix type hint of
StaticEmbedding.__init__
by @altescy in #3196 - [
integration
] Work towards full model2vec integration by @tomaarsen in #3182 - Don't call
set_base_model
whenlocal_files_only=True
by @Davidyz in #3202
New Contributors
- @dependabot made their first contribution in #3192
- @altescy made their first contribution in #3196
- @Davidyz made their first contribution in #3202
Full Changelog: v3.4.0...v3.4.1
v3.4.0 - Resolved memory leak when deleting a model & trainer; add Matryoshka & Cached loss compatibility; small features & bug fixes
This release resolves a memory leak when deleting a model & trainer, adds compatibility between the Cached... losses and the Matryoshka loss modifier, resolves numerous bugs, and adds several small features.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==3.4.0
# Inference only, use one of:
pip install sentence-transformers==3.4.0
pip install sentence-transformers[onnx-gpu]==3.4.0
pip install sentence-transformers[onnx]==3.4.0
pip install sentence-transformers[openvino]==3.4.0
Matryoshka & Cached loss compatibility (#3068, #3107)
It is now possible to combine the strong Cached losses (CachedMultipleNegativesRankingLoss, CachedGISTEmbedLoss, CachedMultipleNegativesSymmetricRankingLoss) with the Matryoshka loss modifier:
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
from datasets import Dataset
model = SentenceTransformer("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({
"anchor": ["It's nice weather outside today.", "He drove to work."],
"positive": ["It's so sunny.", "He took the car to the office."],
})
loss = losses.CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)
loss = losses.MatryoshkaLoss(model, loss, [768, 512, 256, 128, 64])
trainer = SentenceTransformerTrainer(
model=model,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
See for example tomaarsen/mpnet-base-gooaq-cmnrl-mrl which was trained with CachedMultipleNegativesRankingLoss (CMNRL) with the Matryoshka loss modifier (MRL).
Resolve memory leak when Model and Trainer are reinitialized (#3144)
Due to a circular dependency in the SentenceTransformerTrainer
-> SentenceTransformer
-> SentenceTransformerModelCardData
-> SentenceTransformerTrainer
, deleting the trainer and model still doesn't clear them up via garbage disposal. I've moved a lot of components around, and now SentenceTransformerModelCardData
does not need to store the SentenceTransformerTrainer
, breaking the cycle.
We ran the seed optimization script (which frequently creates and deletes models and trainers):
- Before: Approximate highest recorded VRAM:
16332MiB / 24576MiB
- After: Approximate highest recorded VRAM:
8222MiB / 24576MiB
Small Features
- Add Matthews Correlation Coefficient to the BinaryClassificationEvaluator in #3051.
- Add a triplet
margin
parameter to the TripletEvaluator in #2862. - Put dataset information in the automatically generated model card in "expanding sections" blocks if there are many datasets in #3088.
- Add multi-GPU (and CPU multi-process) support for
mine_hard_negatives
in #2967.
Notable Bug Fixes
- Subsequent batches were identical when using the
no_duplicates
Batch Sampler (#3069). This has been resolved in #3073 - The old-style
model.fit()
training withwrite_csv
on an evaluator would crash (#3062). This has been resolved in #3066. - The output types of some evaluators were
np.float
instead offloat
(#3075). This has been resolved in #3076 and #3096. - It was not possible to specify a
revision
orcache_dir
when loading a PEFT Adapter model (#3061). This has been resolved in #3079 and #3174. - The CrossEncoder was lazily placed on the incorrect device, did not respond to
model.to
(#3078). This has been resolved in #3104. - If a model used a custom module with custom kwargs, those
kwargs
keys were not saved inmodules.json
correctly, e.g. relevant for jina-embeddings-v3 (#3111). This has been resolved in #3112. HfArgumentParser(SentenceTransformerTrainingArguments)
would crash due toprompts
typing (#3090). This has been resolved in #3178.
Example Updates
- Update the quantization script in #3070.
- Update the seed optimization script in #3092.
- Update the TSDAE scripts in #3137.
- Add PEFT Adapter script in #3180.
Documentation Updates
- Add PEFT Adapter documentation in #3180.
- Add links to backend-export in Speeding up Inference.
All Changes
- [
training
] Passsteps
/epoch
/output_path
to Evaluator during training by @tomaarsen in #3066 - [
examples
] Update the quantization script by @tomaarsen in #3070 - [
fix
] Fix different batches per epoch in NoDuplicatesBatchSampler by @tomaarsen in #3073 - [
docs
] Add links to backend-export in Speeding up Inference by @tomaarsen in #3071 - add MCC to BinaryClassificationEvaluator by @JINO-ROHIT in #3051
- support cached losses in combination with matryoshka loss by @Marcel256 in #3068
- align model_card_templates.py with code by @amitport in #3081
- converting np float result to float in binary classification evaluator by @JINO-ROHIT in #3076
- Add triplet margin for distance functions in TripletEvaluator by @zivicmilos in #2862
- [
model_card
] Keep the model card readable even with many datasets by @tomaarsen in #3088 - [
docs
] Add NanoBEIR to the Training Overview evaluators by @tomaarsen in #3089 - [fix] revision of the adapter model can now be specified. by @pesuchin in #3079
- [
docs
] Update from Sphinx==3.5.4 to 8.1.3, recommonmark -> myst-parser by @tomaarsen in #3099 - normalize to float in NanoBEIREvaluator, InformationRetrievalEvaluator, MSEEvaluator by @JINO-ROHIT in #3096
- [
docs
] List 'prompts' as a key training argument by @tomaarsen in #3101 - revert float type cast manually in BinaryClassificationEvaluator by @JINO-ROHIT in #3102
- update train_sts_seed_optimization with SentenceTransformerTrainer by @JINO-ROHIT in #3092
- Fix cross encoder device issue by @susnato in #3104
- [
enhancement
] Make MultipleNegativesRankingLoss easier to understand by @tomaarsen in #3100 - [
fix
] Fix breaking change in PyLate when loading modules by @tomaarsen in #3110 - multi-GPU support for mine_hard_negatives by @alperctnkaya in #2967
- raises error when dataset is an empty list in NanoBEIREvaluator by @JINO-ROHIT in #3122
- Added a note to the documentation stating that the similarity method does not support embeddings other than non-quantized ones. by @pesuchin in #3131
- [
typo
] Add missing space between sentences in error message by @tomaarsen in #3125 - raises ValueError when num_label !=1 when using Crossencoder.rank() by @JINO-ROHIT in #3126
- fix backward pass for cached losses by @Marcel256 in #3114
- Adding evaluation checks to prevent Transformer ValueError by @stsfaroz in #3105
- [typo] Fix incorrect spelling for "corpus" by @ignasgr in #3154
- [
fix
] Save custom modulekwargs
if specified by @tomaarsen in #3112 - [
memory
] Avoid storing trainer in ModelCardCallback and SentenceTransformerModelCardData by @tomaarsen in #3144 - Suport for embedded representation by @Radu1999 in #3156
- [DRAFT] tests for nanobeir evaluator by @JINO-ROHIT in #3127
- Update TSDAE examples with SentenceTransformerTrainer by @JINO-ROHIT in #3137
- [
docs
] Update the Static Embedding example snippet by @tomaarsen in #3177 - fix: propagate cache dir to find adapter by @lauralehoczki11 in #3174
- [
fix
] Use HfArgumentParser-compatible typing for prompts by @tomaarsen in #3178 - testcases for community detection by @JINO-ROHIT in #3163
...
v3.3.1 - Patch private model loading without environment variable
This patch release fixes a small issue with loading private models from Hugging Face using the token
argument.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==3.3.1
# Inference only, use one of:
pip install sentence-transformers==3.3.1
pip install sentence-transformers[onnx-gpu]==3.3.1
pip install sentence-transformers[onnx]==3.3.1
pip install sentence-transformers[openvino]==3.3.1
Details
If you're loading model under this scenario:
- Your model is hosted on Hugging Face.
- Your model is private.
- You haven't set the
HF_TOKEN
environment variable viahuggingface-cli login
or some other approach. - You're passing the
token
argument toSentenceTransformer
to load the model.
Then you may have encountered a crash in v3.3.0. This should be resolved now.
All Changes
- [
docs
] Fix the prompt link to the training script by @tomaarsen in #3060 - [Fix] Resolve loading private Transformer model in version 3.3.0 by @pesuchin in #3058
Full Changelog: v3.3.0...v3.3.1
v3.3.0 - Massive CPU speedup with OpenVINO int8 quantization; Training with Prompts for stronger models; NanoBEIR IR evaluation; PEFT compatibility; Transformers v4.46.0 compatibility
4x speedup for CPU with OpenVINO int8 static quantization, training with prompts for a free performance boost, convenient evaluation on NanoBEIR: a subset of a strong Information Retrieval benchmark, PEFT compatibility by easily adding/loading adapters, Transformers v4.46.0 compatibility, and Python 3.8 deprecation.
Install this version with:
# Training + Inference
pip install sentence-transformers[train]==3.3.0
# Inference only, use one of:
pip install sentence-transformers==3.3.0
pip install sentence-transformers[onnx-gpu]==3.3.0
pip install sentence-transformers[onnx]==3.3.0
pip install sentence-transformers[openvino]==3.3.0
OpenVINO int8 static quantization (#3025)
We introduce int8 static quantization using OpenVINO, a highly performant solution that outperforms all other current backends by a mile, at a minimal loss in performance. Here are the updated benchmarks:
Quantizing directly to the Hugging Face Hub
from sentence_transformers import SentenceTransformer, export_static_quantized_openvino_model
# 1. Load a model with the OpenVINO backend
model = SentenceTransformer("all-MiniLM-L6-v2", backend="openvino")
# 2. Quantize the model to int8, push the model to https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
# as a pull request:
export_static_quantized_openvino_model(
model,
quantization_config=None,
model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",
push_to_hub=True,
create_pr=True,
)
You can immediately use the model, even before it's merged, by using the revision
argument:
from sentence_transformers import SentenceTransformer
pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = SentenceTransformer(
"all-MiniLM-L6-v2",
backend="openvino",
model_kwargs={"file_name": "openvino_model_qint8_quantized.xml"},
revision=f"refs/pr/{pull_request_nr}"
)
And once it's merged:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"all-MiniLM-L6-v2",
backend="openvino",
model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"},
)
Quantizing locally
You can also quantize a model and save it locally:
from sentence_transformers import SentenceTransformer, export_static_quantized_openvino_model
from optimum.intel import OVQuantizationConfig
model = SentenceTransformer("all-mpnet-base-v2", backend="openvino")
model.save_pretrained("path/to/all-mpnet-base-v2-local")
quantization_config = OVQuantizationConfig() # <- You can update settings here
export_static_quantized_openvino_model(model, quantization_config, "path/to/all-mpnet-base-v2-local")
And after quantizing, you can load it like so:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"path/to/all-mpnet-base-v2-local",
backend="openvino",
model_kwargs={"file_name": "openvino_model_qint8_quantized.xml"},
)
All original Sentence Transformer models already have these new openvino_model_qint8_quantized.xml
files, so you can load them without exporting directly! I would recommend making pull requests for other models on Hugging Face that you'd like to see quantized.
Learn more about how to Speed up Inference in the documentation: https://sbert.net/docs/sentence_transformer/usage/efficiency.html
Training with Prompts (#2964)
Many modern embedding models are trained with βinstructionsβ or βpromptsβ following the INSTRUCTOR paper. These prompts are strings, prefixed to each text to be embedded, allowing the model to distinguish between different types of text.
For example, the mixedbread-ai/mxbai-embed-large-v1 model was trained with Represent this sentence for searching relevant passages: as the prompt for all queries. This prompt is stored in the model configuration under the prompt name "query", so users can specify that prompt_name in model.encode:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
query_embedding = model.encode("What are Pandas?", prompt_name="query")
# or
# query_embedding = model.encode("What are Pandas?", prompt="Represent this sentence for searching relevant passages: ")
document_embeddings = model.encode([
"Pandas is a software library written for the Python programming language for data manipulation and analysis.",
"Pandas are a species of bear native to South Central China. They are also known as the giant panda or simply panda.",
"Koala bears are not actually bears, they are marsupials native to Australia.",
])
similarity = model.similarity(query_embedding, document_embeddings)
print(similarity)
# => tensor([[0.7594, 0.7560, 0.4674]])
Various papers (INSTRUCTOR, BGE) show that including prompts or instructions both during training and inference results in stronger performance. As of this release, it's now possible to easily train with prompts in Sentence Transformers with just one extra training argument: prompts
. There are 4 accepted formats for it:
str
: A single prompt to use for all columns in all datasets. For example:args = SentenceTransformerTrainingArguments( ..., prompts="text: ", ..., )
Dict[str, str]
: A dictionary mapping column names to prompts, applied to all datasets. For example:args = SentenceTransformerTrainingArguments( ..., prompts={ "query": "query: ", "answer": "document: ", }, ..., )
Dict[str, str]
: A dictionary mapping dataset names to prompts. This should only be used if your training/evaluation/test datasets are aDatasetDict
or a dictionary ofDataset
. For example:args = SentenceTransformerTrainingArguments( ..., prompts={ "stsb": "Represent this text for semantic similarity search: ", "nq": "Represent this text for retrieval: ", }, ..., )
Dict[str, Dict[str, str]]
: A dictionary mapping dataset names to dictionaries mapping column names to prompts. This should only be used if your training/evaluation/test datasets are aDatasetDict
or a dictionary ofDataset
. For example:args = SentenceTransformerTrainingArguments( ..., prompts={ "stsb": { "sentence1": "sts: ", "sentence2": "sts: ", }, "nq": { "query": "query: ", "document": "document: ", }, }, ..., )
I've trained models with and without prompts for 2 base models: mpnet-base and bert-base-uncased:
- tomaarsen/mpnet-base-nq
- tomaarsen/mpnet-base-nq-prompts
- tomaarsen/bert-base-nq
- tomaarsen/bert-base-nq-prompts
For both base models, the model with prompts consistently outperformed the baseline model. After training, the models with prompts resulted in a 0.66% and 0.90% relative improvement on NDCG@10 at no extra cost.
mpnet-base tests |
bert-base-uncased tests |
---|---|
- Training with Prompts documentation: https://sbert.net/examples/training/prompts/README.html
- Training with Prompts training script: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/prompts/training_nq_prompts.py
NanoBEIR Evaluator integration (#2966)
This update introduced a new simple NanoBEIREvaluator
, evaluating your model against NanoBEIR: a collection of subsets of the 13 BEIR datasets. BEIR corresponds to the retrieval tab of MTEB, and is commonly seen as a valuable indicator of general-purpose information retrieval performance.
With the NanoBEIREvaluator
, you can easily evaluate your models on a much faster benchmark that should give similar insights in performance...
v3.2.1 - Patch CLIP loading, small ONNX fix, compatibility with other libraries
This patch release fixes some small bugs, such as related to loading CLIP models, automatic model card generation issues, and ensuring compatibility with third party libraries.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==3.2.1
# Inference only, use one of:
pip install sentence-transformers==3.2.1
pip install sentence-transformers[onnx-gpu]==3.2.1
pip install sentence-transformers[onnx]==3.2.1
pip install sentence-transformers[openvino]==3.2.1
Fixing Loading non-Transformer models
In v3.2.0, a non-Transformer based model (e.g. CLIP) would not load correctly if the model was saved in the root of the model repository/directory. This has been resolved in #3007.
Throw error if StaticEmbedding
-based model is finetuned with incompatible losses
The following losses are not compatible with StaticEmbedding
-based models:
- CachedGISTEmbedLoss
- CachedMultipleNegativesRankingLoss
- CachedMultipleNegativesSymmetricRankingLoss
- DenoisingAutoEncoderLoss
- GISTEmbedLoss
An error is now thrown when one of these are used with a StaticEmbedding
-based model. I recommend using MultipleNegativesRankingLoss to finetune these models, e.g. as in https://huggingface.co/tomaarsen/static-bert-uncased-gooaq.
Note: to get good performance, you must use much higher learning rates than otherwise. In my experiments, 2e-1 worked well.
Patch ONNX model when the model uses output_hidden_states
For example, this script used to fail, but passes now:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"distiluse-base-multilingual-cased",
backend="onnx",
model_kwargs={"provider": "CPUExecutionProvider"},
)
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings.shape)
All changes
- Bump optimum version by @echarlaix in #2984
- [
docs
] Update the training snippets for some losses that should use the v3 Trainer by @tomaarsen in #2987 - [
enh
] Throw error if StaticEmbedding-based model is trained with incompatible loss by @tomaarsen in #2990 - [
fix
] Fix semantic_search_usearch with 'binary' by @tomaarsen in #2989 - [enh] Add support for large_string in model card create by @yaohwang in #2999
- [
model cards
] Prevent crash on generating widgets if dataset column is empty by @tomaarsen in #2997 - [fix] Added model2vec import compatible with current and newer version by @Pringled in #2992
- Fix cache_dir issue with loading CLIPModel by @BoPeng in #3007
- [
warn
] Throw a warning if compute_metrics is set, as it's not used by @tomaarsen in #3002 - [
fix
] Prevent IndexError if output_hidden_states & ONNX by @tomaarsen in #3008
New Contributors
- @echarlaix made their first contribution in #2984
- @yaohwang made their first contribution in #2999
- @Pringled made their first contribution in #2992
- @BoPeng made their first contribution in #3007
Full Changelog: v3.2.0...v3.2.1