Releases · UKPLab/sentence-transformers

This release introduces 2 new efficient computing backends for SparseEncoder embedding models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; a new "n-tuple-score" output format for hard negative mining for distillation; gathering across devices for free lunch on multi-gpu training; trackio support; MTEB documentation; any many small fixes and features.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.1.0

# Inference only, use one of:
pip install sentence-transformers==5.1.0
pip install sentence-transformers[onnx-gpu]==5.1.0
pip install sentence-transformers[onnx]==5.1.0
pip install sentence-transformers[openvino]==5.1.0

Faster ONNX and OpenVINO backends for SparseEncoder models (#3475)

Introducing a new backend keyword argument to the SparseEncoder initialization, allowing values of "torch" (default), "onnx", and "openvino".
These require installing sentence-transformers with specific extras:

pip install sentence-transformers[onnx-gpu]
# or ONNX for CPU only:
pip install sentence-transformers[onnx]
# or
pip install sentence-transformers[openvino]

It's as simple as:

from sentence_transformers import SparseEncoder

# Load a SparseEncoder model with the ONNX backend
model = SparseEncoder("naver/splade-v3", backend="onnx")

query = "Which planet is known as the Red Planet?"
documents = [
   "Venus is often called Earth's twin because of its similar size and proximity.",
   "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
   "Jupiter, the largest planet in our solar system, has a prominent red spot.",
   "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

query_embeddings = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# torch.Size([30522]) torch.Size([4, 30522])

similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[12.1450, 26.1040, 22.0025, 23.3877]])

decoded_query = model.decode(query_embeddings, top_k=5)
decoded_documents = model.decode(document_embeddings, top_k=5)
print(decoded_query)
# [('red', 3.0222), ('planet', 2.5001), ('planets', 1.9412), ('known', 1.8126), ('nasa', 0.9347)]
print(decoded_documents)
# [
#     [('venus', 3.1980), ('twin', 2.7036), ('earth', 2.4310), ('twins', 2.0957), ('planet', 1.9462)],
#     [('mars', 3.1443), ('planet', 2.4924), ('red', 2.4514), ('reddish', 2.2234), ('planets', 2.1976)],
#     [('jupiter', 2.9604), ('red', 2.5507), ('planet', 2.3774), ('planets', 2.1641), ('spot', 2.1138)],
#     [('saturn', 2.9354), ('red', 2.4548), ('planet', 2.3962), ('mistaken', 2.3361), ('cass', 2.2100)]
# ]

If you specify a backend and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub or model.save_pretrained into the same model repository or directory to avoid having to re-export the model every time.

All keyword arguments passed via model_kwargs will be passed on to ORTModelForMaskedLM.from_pretrained or ORTModelForMaskedLM.from_pretrained. The most useful arguments are:

provider: (Only if backend="onnx") ONNX Runtime provider to use for loading the model, e.g. "CPUExecutionProvider" . See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g. "CUDAExecutionProvider") will be used.
file_name: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.
export: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.

Benchmarks

We ran benchmarks for CPU and GPU, averaging findings across 3 datasets, and numerous batch sizes. Here are the findings:

These findings resulted in these recommendations:

For GPU, you can expect 1.81x speedup with bf16 at no cost, and for CPU you can expect up to ~3x speedup at minimal cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.

Read the Speeding up Inference documentation for more details.

New `n-tuple-scores` output format from `mine_hard_negatives` (#3430, #3481)

The mine_hard_negatives utility function has been extended to support the n-tuple-scores output format, which outputs negatives into num_negatives + 3 columns:

'query', 'answer', 'negative_1', 'negative_2', ..., 'score'

where the 'score' is a list of scores for the query-answer plus each query-negative pair.

from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

# Load a Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")

# Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train")

# Mine hard negatives into num_negatives + 3 columns: 'query', 'answer', 'negative_1', 'negative_2', ..., 'score'
# where 'score' is a list of scores for the query-answer plus each query-negative pair.
dataset = mine_hard_negatives(
    dataset=dataset,
    model=model,
    num_negatives=5,
    sampling_strategy="top",
    batch_size=128,
    use_faiss=True,
    output_format="n-tuple-scores",
)
print(dataset)
print(dataset[14])
"""
{
    'query': 'when did jack and the beanstalk take place',
    'answer': "Jack and the Beanstalk According to researchers at the universities in Durham and Lisbon, the story originated more than 5,000 years ago, based on a widespread archaic story form which is now classified by folklorists as ATU 328 The Boy Who Stole Ogre's Treasure.[7]",
    'negative_1': 'Jack and the Beanstalk "Jack and the Beanstalk" is an English fairy tale. It appeared as "The Story of Jack Spriggins and the Enchanted Bean" in 1734[1] and as Benjamin Tabart\'s moralised "The History of Jack and the Bean-Stalk" in 1807.[2] Henry Cole, publishing under pen name Felix Summerly popularised the tale in The Home Treasury (1845),[3] and Joseph Jacobs rewrote it in English Fairy Tales (1890).[4] Jacobs\' version is most commonly reprinted today and it is believed to be closer to the oral versions than Tabart\'s because it lacks the moralising.[5]',
    'negative_2': 'Jack and the Beanstalk Jack climbs the beanstalk twice more. He learns of other treasures and steals them when the giant sleeps: first a goose that lays golden eggs, then a magic harp that plays by itself. The giant wakes when Jack leaves the house with the harp and chases Jack down the beanstalk. Jack calls to his mother for an axe and before the giant reaches the ground, cuts down the beanstalk, causing the giant to fall to his death.',
    'negative_3': 'Jack in the Box Jack in the Box is an American fast-food restaurant chain founded February 21, 1951, by Robert O. Peterson in San Diego, California, where it is headquartered. The chain has 2,200 locations, primarily serving the West Coast of the United States and selected large urban areas in the eastern portion of the US including Texas. Food items include a variety of hamburger and cheeseburger sandwiches along with selections of internationally themed foods such as tacos and egg rolls. The company also operates the Qdoba Mexican Grill chain.[4][5]',
    'negative_4': 'Jack in the Box Jack in the Box is an American fast-food restaurant chain founded February 21, 1951, by Robert O. Peterson in San Diego, California, where it is headquartered. The chain has 2,200 locations, primarily serving the West Coast of the United States and selected large urban areas in the eastern portion of the US including Texas and the Charlotte metropolitan area. The company also formerly operated the Qdoba Mexican Grill chain until Apollo Global Management bought the chain in December 2017.[4]',
    'negative_5': "Jack Box Jack Box (full name Jack I. Box; or simply known as Jack) is the mascot of American restaurant chain Jack in the Box. In the advertisements, he is the founder, CEO, and ad spokesman for the chain. According to the company's web site, he has the appearance of a typical male, with the exception of his huge spherical white head, blue dot eyes, conical black pointed nose, and a curvilinear red smile. He is most of the time seen wearing his yellow clown cap, and a business suit driving a red Viper convertible.",
    'score': [0.7949077486991882, 0.8010389804840088, 0.646654963493347...

This release consists of significant updates including the introduction of Sparse Encoder models, new methods encode_query and encode_document, multi-processing support in encode, the Router module for asymmetric models, custom learning rates for parameter groups, composite loss logging, and various small improvements and bug fixes.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.0.0

# Inference only, use one of:
pip install sentence-transformers==5.0.0
pip install sentence-transformers[onnx-gpu]==5.0.0
pip install sentence-transformers[onnx]==5.0.0
pip install sentence-transformers[openvino]==5.0.0

Tip

Our Training and Finetuning Sparse Embedding Models with Sentence Transformers v5 blogpost is an excellent place to learn about finetuning sparse embedding models!

Note

This release is designed to be fully backwards compatible, meaning that you should be able to upgrade from older versions to v5.x without any issues. If you are running into issues when upgrading, feel free to open an issue. Also see the Migration Guide for changes that we would recommend.

Sparse Encoder models

The Sentence Transformers v5.0 release introduces Sparse Embedding models, also known as Sparse Encoders. These models generate high-dimensional embeddings, often with 30,000+ dimensions, where often only <1% of dimensions are non-zero. This is in contrast to the standard dense embedding models, which produce low-dimensional embeddings (e.g., 384, 768, or 1024 dimensions) where all values are non-zero.

Usually, each active dimension (i.e. the dimension with a non-zero value) in a sparse embedding corresponds to a specific token in the model's vocabulary, allowing for interpretability. This means that you can e.g. see exactly which words/tokens are important in an embedding, and that you can inspect exactly because of which words/tokens two texts are deemed similar.

Let's have a look at naver/splade-v3, a strong sparse embedding model, as an example:

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("naver/splade-v3")

# Run inference
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 30522)

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[   32.4323,     5.8528,     0.0258],
#         [    5.8528,    26.6649,     0.0302],
#         [    0.0258,     0.0302,    24.0839]])

# Let's decode our embeddings to be able to interpret them
decoded = model.decode(embeddings, top_k=10)
for decoded, sentence in zip(decoded, sentences):
    print(f"Sentence: {sentence}")
    print(f"Decoded: {decoded}")
    print()

Sentence: The weather is lovely today.
Decoded: [('weather', 2.754288673400879), ('today', 2.610959529876709), ('lovely', 2.431990623474121), ('currently', 1.5520408153533936), ('beautiful', 1.5046082735061646), ('cool', 1.4664798974990845), ('pretty', 0.8986214995384216), ('yesterday', 0.8603134155273438), ('nice', 0.8322536945343018), ('summer', 0.7702118158340454)]

Sentence: It's so sunny outside!
Decoded: [('outside', 2.6939032077789307), ('sunny', 2.535827398300171), ('so', 2.0600898265838623), ('out', 1.5397940874099731), ('weather', 1.1198079586029053), ('very', 0.9873268604278564), ('cool', 0.9406591057777405), ('it', 0.9026399254798889), ('summer', 0.684999406337738), ('sun', 0.6520509123802185)]

Sentence: He drove to the stadium.
Decoded: [('stadium', 2.7872302532196045), ('drove', 1.8208855390548706), ('driving', 1.6665740013122559), ('drive', 1.5565159320831299), ('he', 1.4721972942352295), ('stadiums', 1.449463129043579), ('to', 1.0441515445709229), ('car', 0.7002660632133484), ('visit', 0.5118278861045837), ('football', 0.502326250076294)]

In this example, the embeddings are 30,522-dimensional vectors, where each dimension corresponds to a token in the model's vocabulary. The decode method returned the top 10 tokens with the highest values in the embedding, allowing us to interpret which tokens contribute most to the embedding.

We can even determine the intersection or overlap between embeddings, very useful for determining why two texts are deemed similar or dissimilar:

# Let's also compute the intersection/overlap of the first two embeddings
intersection_embedding = model.intersection(embeddings[0], embeddings[1])
decoded_intersection = model.decode(intersection_embedding)
print(decoded_intersection)

Decoded: [('weather', 3.0842742919921875), ('cool', 1.379457712173462), ('summer', 0.5275946259498596), ('comfort', 0.3239051103591919), ('sally', 0.22571465373039246), ('julian', 0.14787325263023376), ('nature', 0.08582140505313873), ('beauty', 0.0588383711874485), ('mood', 0.018594780936837196), ('nathan', 0.000752730411477387)]

And if we think the embeddings are too big, we can limit the maximum number of active dimensions like so:

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("naver/splade-v3")  # You can also set max_active_dims here instead of encode()

# Run inference
documents = [
    "UV-A light, specifically, is what mainly causes tanning, skin aging, and cataracts, UV-B causes sunburn, skin aging and skin cancer, and UV-C is the strongest, and therefore most effective at killing microorganisms. Again â\x80\x93 single words and multiple bullets.",
    "Answers from Ronald Petersen, M.D. Yes, Alzheimer's disease usually worsens slowly. But its speed of progression varies, depending on a person's genetic makeup, environmental factors, age at diagnosis and other medical conditions. Still, anyone diagnosed with Alzheimer's whose symptoms seem to be progressing quickly â\x80\x94 or who experiences a sudden decline â\x80\x94 should see his or her doctor.",
    "Bell's palsy and Extreme tiredness and Extreme fatigue (2 causes) Bell's palsy and Extreme tiredness and Hepatitis (2 causes) Bell's palsy and Extreme tiredness and Liver pain (2 causes) Bell's palsy and Extreme tiredness and Lymph node swelling in children (2 causes)",
]
embeddings = model.encode_document(documents, max_active_dims=64)
print(embeddings.shape)
# (3, 30522)

# Print the sparsity of the embeddings
sparsity = model.sparsity(embeddings)
print(sparsity)
# {'active_dims': 64.0, 'sparsity_ratio': 0.9979031518249132}

Click to see that it has minimal impact on scores

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("naver/splade-v3")  # You can also set max_active_dims here instead of encode()

# Run inference
queries = ["what causes aging fast"]
documents = [
    "UV-A light, specifically, is what mainly causes tanning, skin aging, and cataracts, UV-B causes sunburn, skin aging and skin cancer, and UV-C is the strongest, and therefore most effective at killing microorganisms. Again â\x80\x93 single words and multiple bullets.",
    "Answers from Ronald Petersen, M.D. Yes, Alzheimer's disease usually worsens slowly. But its speed of progression varies, depending on a person's genetic makeup, environmental factors, age at diagnosis and other medical conditions. Still, anyone diagnosed with Alzheimer's whose symptoms seem to be progressing quickly â\x80\x94 or who experiences a sudden decline â\x80\x94 should see his or her doctor.",
    "Bell's palsy and Extreme tiredness and Extreme fatigue (2 causes) Bell's palsy and Extreme tiredness and Hepatitis (2 causes) Bell's palsy and Extreme tiredness and Liver pain (2 causes) Bell's palsy and Extreme tiredness and Lymph node swelling in children (2 causes)",
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)

# Determine the sparsity
query_sparsity = model.sparsity(query_embeddings)
document_sparsity = model.sparsity(document_embeddings)
print(query_sparsity, document_sparsity)
# {'active_dims': 28.0, 'sparsity_ratio': 0.9990826289233995} {'active_dims': 174.6666717529297, 'sparsity_ratio': 0.9942773516888497}

# Calculate the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[11.3767, 10.8296,  4.3457]], device='cuda:0')

# Again with smaller max_active_dims
smaller_document_embeddings = model.encode_document(documents, max_active_dims=64)

# Determine the sparsity for the smaller document embeddings
smaller_document_sparsity = model.sparsity(smaller_document_embeddings)
print(query_sparsity, smaller_document_sparsity)
# {'active_dims': 28.0, 'sparsity_ratio': 0.9990826289233995} {'active_dims': 64.0, 'sparsity_ratio': 0.9979031518249132}

# Print the similarity scores for the smaller document embeddings
smaller_similarities = model.similarity(query_embeddings, smaller_document_embeddings)
print(smaller_similarities)
# tensor([[10.1311,  9.8360,  4.3457]], device='cuda:0')

# Very similar to the scores for the full document embeddings!

Are they any good?

A big question is: How do sparse embedding models stack up against the “standard” dense embedding models, and what kind of performance can you expect when combining various?

For this, I ran a variation of our hybrid_search.py evaluation script, with:

The [Nano...

This release introduces 2 new efficient computing backends for CrossEncoder (reranker) models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; improved hard negatives mining strategies, and minor improvements.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==4.1.0

# Inference only, use one of:
pip install sentence-transformers==4.1.0
pip install sentence-transformers[onnx-gpu]==4.1.0
pip install sentence-transformers[onnx]==4.1.0
pip install sentence-transformers[openvino]==4.1.0

Faster ONNX and OpenVINO Backends for CrossEncoder (#3319)

Introducing a new backend keyword argument to the CrossEncoder initialization, allowing values of "torch" (default), "onnx", and "openvino".
These require installing sentence-transformers with specific extras:

pip install sentence-transformers[onnx-gpu]
# or ONNX for CPU only:
pip install sentence-transformers[onnx]
# or
pip install sentence-transformers[openvino]

It's as simple as:

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")

query = "Which planet is known as the Red Planet?"
passages = [
   "Venus is often called Earth's twin because of its similar size and proximity.",
   "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
   "Jupiter, the largest planet in our solar system, has a prominent red spot.",
   "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)

If you specify a backend and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub or model.save_pretrained into the same model repository or directory to avoid having to re-export the model every time.

All keyword arguments passed via model_kwargs will be passed on to ORTModelForSequenceClassification.from_pretrained or OVModelForSequenceClassification.from_pretrained. The most useful arguments are:

provider: (Only if backend="onnx") ONNX Runtime provider to use for loading the model, e.g. "CPUExecutionProvider" . See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g. "CUDAExecutionProvider") will be used.
file_name: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.
export: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.

For example:

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
	backend="onnx",
	model_kwargs={
		"file_name": "model_O3.onnx",
		"provider": "CPUExecutionProvider",
	}
)

query = "Which planet is known as the Red Planet?"
passages = [
   "Venus is often called Earth's twin because of its similar size and proximity.",
   "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
   "Jupiter, the largest planet in our solar system, has a prominent red spot.",
   "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)

Benchmarks

We ran benchmarks for CPU and GPU, averaging findings across 4 models of various sizes, 3 datasets, and numerous batch sizes. Here are the findings:

These findings resulted in these recommendations:

For GPU, you can expect 1.88x speedup with fp16 at no cost, and for CPU you can expect ~3x speedup at no cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.

Read the Speeding up Inference documentation for more details.

ONNX & OpenVINO Optimization and Quantization

In addition to exporting default ONNX and OpenVINO models, you can also use one of the helper methods for optimizing and quantizing ONNX models:

ONNX Optimization

export_optimized_onnx_model: This function uses Optimum to implement several optimizations in the ONNX model, ranging from basic optimizations to approximations and mixed precision. Read about the 4 default options here. This function accepts:

model A SentenceTransformer or CrossEncoder model loaded with backend="onnx".
optimization_config: "O1", "O2", "O3", or "O4" from 🤗 Optimum or a custom OptimizationConfig instance.
model_name_or_path: The directory or model repository where the optimized model will be saved.
push_to_hub: Whether the push the exported model to the hub with model_name_or_path as the repository name. If False, the model will be saved in the directory specified with model_name_or_path.
create_pr: If push_to_hub, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for optimizing models of repositories that you don't have write access to.
file_suffix: The suffix to add to the optimized model file name. Will use the optimization_config string or "optimized" if not set.

The usage is like this:

from sentence_transformers import SentenceTransformer, export_optimized_onnx_model

onnx_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
export_optimized_onnx_model(
	model=onnx_model,
	optimization_config="O4",
	model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2",
	push_to_hub=True,
	create_pr=True,
)

After which you can load the model with:

from sentence_transformers import CrossEncoder

pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="onnx",
    model_kwargs={"file_name": "onnx/model_O4.onnx"},
    revision=f"refs/pr/{pull_request_nr}"
)

or when it gets merged:

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="onnx",
    model_kwargs={"file_name": "onnx/model_O4.onnx"},
)

ONNX Quantization

export_dynamic_quantized_onnx_model: This function uses Optimum to quantize the ONNX model to int8, also allowing for hardware-specific optimizations. This results in impressive speedups for CPUs. In my findings, each of the default quantization configuration options gave approximately the same performance improvements. This function accepts

model A SentenceTransformer or CrossEncoder model loaded with backend="onnx".
quantization_config: "arm64", "avx2", "avx512", or "avx512_vnni" representing quantization configurations from AutoQuantizationConfig, or an QuantizationConfig instance.
model_name_or_path: The directory or model repository where the optimized model will be saved.
push_to_hub: Whether the push the exported model to the hub with model_name_or_path as the repository name. If False, the model will be saved in the directory specified with model_name_or_path.
create_pr: If push_to_hub, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for quantizing models of repositories that you don't have write access to.
file_suffix: The suffix to add to the optimized model file name. Will use the quantization_config string or e.g. "int8_quantized" if not set.

The usage is like this:

from sentence_transformers import CrossEncoder, export_dynamic_quantized_onnx_model

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
export_dynamic_quantized_onnx_model(
    mod...

@emmanuel-ferdman

This patch release updates some logic for maximum sequence lengths, typing issues, FSDP training, and distributed training device placement.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==4.0.2

# Inference only, use one of:
pip install sentence-transformers==4.0.2
pip install sentence-transformers[onnx-gpu]==4.0.2
pip install sentence-transformers[onnx]==4.0.2
pip install sentence-transformers[openvino]==4.0.2

Safer CrossEncoder (reranker) maximum sequence length

When loading CrossEncoder models, we now rely on the minimum of the tokenizer model_max_length and the config max_position_embeddings (if they exist), rather than only relying on the latter if it exists. This previously resulted in the maximum sequence length of BAAI/bge-reranker-base being 514, whereas it can only handle sequences up to 512 tokens.

from sentence_transformers import CrossEncoder

model = CrossEncoder("BAAI/bge-reranker-base")
print(model.max_length)
# => 512

# The texts for which to predict similarity scores
query = "How many people live in Berlin?"
passages = [
    "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
    "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)
# => [0.99953485 0.01062613]

# Or test really long inputs to ensure that there's no crash:
score = model.predict([["one " * 1000, "two " * 1000]])
print(score)
# => [0.95482624]

Note that you can use the activation_fn option with torch.nn.Identity() to avoid the default Sigmoid that maps everything to [0, 1]:

from sentence_transformers import CrossEncoder
import torch

model = CrossEncoder("BAAI/bge-reranker-base", activation_fn=torch.nn.Identity())

# The texts for which to predict similarity scores
query = "How many people live in Berlin?"
passages = [
    "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
    "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)
# => [ 7.672551  -4.5337563]

Default device placement (#3303)

By default, in a distributed training setup with multiple CUDA devices, the model is now placed on the CUDA device corresponding with that local rank. This should lower the VRAM usage on GPU 0 when performing distributed training.

Minor patches of note

Resolved typing issues for SentenceTransformer class outside of the encode method. In v4.0.1, it was possible to no longer get help from your IDE for e.g. model.similarity, for example. (#3297)
Improve FSDP training compatibility by avoiding a faulty "only if model is wrapped"-check. Now, the wrapped model should always be laced in the loss class instance when required for FSDP training. (#3295)

All Changes

[docs]: update examples by @emmanuel-ferdman in #3292
Update htaccess, in-line comments were problematic by @tomaarsen in #3293
[docs] Resolve more broken links throughout the docs by @tomaarsen in #3294
[docs] Fix some broken docs redirects by @tomaarsen in #3296
[typing] Move encode typings back to .py from .pyi by @tomaarsen in #3297
[fix] Avoid "Only if model is wrapped" check which is faulty for FSDP by @tomaarsen in #3295
[cross-encoder] Set the tokenizer model_max_length to the min. of model_max_length & max_pos_embeds by @tomaarsen in #3304
[ci] Attempt to fix CI by @tomaarsen in #3305
Fix device assignment in get_device_name for distributed training by @uminaty in #3303
[docs] Add missing docstring for push_to_hub by @tomaarsen in #3306
[docs] Specify that exported ONNX/OpenVINO models don't include pooling/normalization by @tomaarsen in #3307

New Contributors

@emmanuel-ferdman made their first contribution in #3292
@uminaty made their first contribution in #3303

Full Changelog: v4.0.1...v4.0.2

This release consists of a major refactor that overhauls the reranker a.k.a. Cross Encoder training approach (introducing multi-gpu training, bf16, loss logging, callbacks, and much more), including all new Training Overview, Loss Overview, API Reference docs, training examples and more!

Install this version with

# Training + Inference
pip install sentence-transformers[train]==4.0.1

# Inference only, use one of:
pip install sentence-transformers==4.0.1
pip install sentence-transformers[onnx-gpu]==4.0.1
pip install sentence-transformers[onnx]==4.0.1
pip install sentence-transformers[openvino]==4.0.1

Tip

My Training and Finetuning Reranker Models with Sentence Transformers v4 blogpost is an excellent place to learn 1) why finetuning rerankers makes sense and 2) how you can do it, too!

Reranker (Cross Encoder) training refactor (#3222)

The v4.0 release centers around this huge modernization of the training approach for CrossEncoder models, following v3.0 which introduced the same for SentenceTransformer models. Whereas training before v4.0 used to be all about InputExample, DataLoader and model.fit, the new training approach relies on 5 components. You can learn more about these components in our Training and Finetuning Embedding Models with Sentence Transformers v4 blogpost. Additionally, you can read the new Training Overview, check out the Training Examples, or read this summary:

Dataset
A training Dataset or DatasetDict. This class is much more suited for sharing & efficient modifications than lists/DataLoaders of InputExample instances. A Dataset can contain multiple text columns that will be fed in order to the corresponding loss function. So, if the loss expects (anchor, positive, negative) triplets, then your dataset should also have 3 columns. The names of these columns are irrelevant. If there is a "label" or "score" column, it is treated separately, and used as the labels during training.
A DatasetDict can be used to train with multiple datasets at once, e.g.:
```
DatasetDict({
    natural_questions: Dataset({
        features: ['anchor', 'positive'],
        num_rows: 392702
    })
    gooaq: Dataset({
        features: ['anchor', 'positive', 'negative'],
        num_rows: 549367
    })
    stsb: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 5749
    })
})
```
When a DatasetDict is used, the loss parameter to the CrossEncoderTrainer must also be a dictionary with these dataset keys, e.g.:
```
{
    'natural_questions': CachedMultipleNegativesRankingLoss(...),
    'gooaq': CachedMultipleNegativesRankingLoss(...),
    'stsb': BinaryCrossEntropyLoss(...),
}
```
Loss Function
A loss function, or a dictionary of loss functions like described above.
Training Arguments
A CrossEncoderTrainingArguments instance, subclass of a TrainingArguments instance. This powerful class controls the specific details of the training.
Evaluator
An optional SentenceEvaluator instance. Unlike before, models can now be evaluated both on an evaluation dataset with some loss function and/or a SentenceEvaluator instance.
Trainer
The new CrossEncoderTrainer instance based on the transformers Trainer. This instance can be initialized with a CrossEncoder model, a CrossEncoderTrainingArguments class, a SentenceEvaluator, a training and evaluation Dataset/DatasetDict and a loss function/dict of loss functions. Most of these parameters are optional. Once provided, all you have to do is call trainer.train().

Some of the major features that are now implemented include:

MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
bf16 training support
Loss logging
Evaluation datasets + evaluation loss
Improved callback support (built-in via Weights and Biases, TensorBoard, CodeCarbon, etc., as well as custom callbacks)
Gradient checkpointing
Gradient accumulation
Improved model card generation
Warmup ratio
Pushing to the Hugging Face Hub on every model checkpoint
Resuming from a training checkpoint
Hyperparameter Optimization

This script is a minimal example (no evaluator, no training arguments) of training mpnet-base on a part of the sentence-transformers/hotpotqa dataset using BinaryCrossEntropyLoss:

from datasets import load_dataset
from sentence_transformers import CrossEncoder, CrossEncoderTrainer
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss

# 1. Define the model. Either from scratch of by loading a pre-trained model
model = CrossEncoder("microsoft/mpnet-base")

# 2. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/hotpotqa", "triplet", split="train")

def triplet_to_labeled_pair(batch):
    anchors = batch["anchor"]
    positives = batch["positive"]
    negatives = batch["negative"]
    return {
        "sentence_A": anchors * 2,
        "sentence_B": positives + negatives,
        "labels": [1] * len(positives) + [0] * len(negatives),
    }

dataset = dataset.map(triplet_to_labeled_pair, batched=True, remove_columns=dataset.column_names)
train_dataset = dataset.select(range(10_000))
eval_dataset = dataset.select(range(10_000, 11_000))

# 3. Define a loss function
loss = BinaryCrossEntropyLoss(model)

# 4. Create a trainer & train
trainer = CrossEncoderTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
trainer.train()

# 5. Save the trained model
model.save_pretrained("models/mpnet-base-hotpotqa")
# model.push_to_hub("mpnet-base-hotpotqa")

Additionally, trained models now automatically produce extensive model cards. Each of the following models were trained using some script from the Training Examples, and the model cards were not edited manually whatsoever:

Prior to the Sentence Transformer v4 release, all reranker models would be trained using the CrossEncoder.fit method. Rather than deprecating this method, starting from v4.0, this method will use the CrossEncoderTrainer behind the scenes. This means that your old training code should still work, and should even be upgraded with the new features such as multi-gpu training, loss logging, etc. That said, the new training approach is much more powerful, so it is recommended to write new training scripts using the new approach.

To help you out, all of the Cross Encoder (a.k.a. reranker) training scripts were updated to use the new Trainer-based approach.

Is finetuning worth it?

Finetuning reranker models on your data is very valuable. Consider for example these 2 models that I finetuned on 100k samples from the GooAQ dataset in 30 minutes and 1 hour, respectively. After finetuning, my models heavily outperformed general-purpose reranker models, even though GooAQ is a very generic dataset/domain!

Read my Training and Finetuning Reranker Models with Sentence Transformers v4 blogpost for many more details on these models and how they were trained.

Resources:

How to use Cross Encoder models? [Cross Encoder > Usage](ht...

@JINO-ROHIT

This release introduces a convenient compatibility with Model2Vec models, and fixes a bug that caused an outgoing request even when using a local model.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==3.4.1

# Inference only, use one of:
pip install sentence-transformers==3.4.1
pip install sentence-transformers[onnx-gpu]==3.4.1
pip install sentence-transformers[onnx]==3.4.1
pip install sentence-transformers[openvino]==3.4.1

Full Model2Vec integration

This release introduces support to load an efficient Model2Vec embedding model directly in Sentence Transformers:

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer(
    "minishlab/potion-base-8M",
    device="cpu",
)

# Run inference
sentences = [
    'Gadofosveset-enhanced MR angiography of carotid arteries: does steady-state imaging improve accuracy of first-pass imaging?',
    'To evaluate the diagnostic accuracy of gadofosveset-enhanced magnetic resonance (MR) angiography in the assessment of carotid artery stenosis, with digital subtraction angiography (DSA) as the reference standard, and to determine the value of reading first-pass, steady-state, and "combined" (first-pass plus steady-state) MR angiograms.',
    'In a longitudinal study we investigated in vivo alterations of CVO during neuroinflammation, applying Gadofluorine M- (Gf) enhanced magnetic resonance imaging (MRI) in experimental autoimmune encephalomyelitis, an animal model of multiple sclerosis. SJL/J mice were monitored by Gadopentate dimeglumine- (Gd-DTPA) and Gf-enhanced MRI after adoptive transfer of proteolipid-protein-specific T cells. Mean Gf intensity ratios were calculated individually for different CVO and correlated to the clinical disease course. Subsequently, the tissue distribution of fluorescence-labeled Gf as well as the extent of cellular inflammation was assessed in corresponding histological slices.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings[0], embeddings[1:])
print(similarities)
# tensor([[0.8085, 0.4884]])

Previously, loading a Model2Vec model required you to load a `StaticEmbedding` module.

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

# Download from the 🤗 Hub
module = StaticEmbedding.from_model2vec("minishlab/potion-base-8M")
model = SentenceTransformer(modules=[module], device="cpu")

# Run inference
sentences = [
    'Gadofosveset-enhanced MR angiography of carotid arteries: does steady-state imaging improve accuracy of first-pass imaging?',
    'To evaluate the diagnostic accuracy of gadofosveset-enhanced magnetic resonance (MR) angiography in the assessment of carotid artery stenosis, with digital subtraction angiography (DSA) as the reference standard, and to determine the value of reading first-pass, steady-state, and "combined" (first-pass plus steady-state) MR angiograms.',
    'In a longitudinal study we investigated in vivo alterations of CVO during neuroinflammation, applying Gadofluorine M- (Gf) enhanced magnetic resonance imaging (MRI) in experimental autoimmune encephalomyelitis, an animal model of multiple sclerosis. SJL/J mice were monitored by Gadopentate dimeglumine- (Gd-DTPA) and Gf-enhanced MRI after adoptive transfer of proteolipid-protein-specific T cells. Mean Gf intensity ratios were calculated individually for different CVO and correlated to the clinical disease course. Subsequently, the tissue distribution of fluorescence-labeled Gf as well as the extent of cellular inflammation was assessed in corresponding histological slices.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings[0], embeddings[1:])
print(similarities)
# tensor([[0.8085, 0.4884]])

Model2Vec was the inspiration of the recent Static Embedding work; all of these models can be used to approach the performance of normal transformer-based embedding models at a fraction of the latency. For example, both Model2Vec and Static Embedding models are ~25x faster than tiny embedding models on a GPU and ~400x faster than those models on a CPU.

Bug Fix

Using local_files_only=True still triggered a request to Hugging Face for the model card metadata; this has been resolved in (#3202).

All Changes

fix loss name in documentation of CachedMultipleNegativesRankingLoss by @JINO-ROHIT in #3191
Bump jinja2 from 3.1.4 to 3.1.5 in /docs by @dependabot in #3192
minor typo in MegaBatchMarginLoss by @JINO-ROHIT in #3193
Fix type hint of StaticEmbedding.__init__ by @altescy in #3196
[integration] Work towards full model2vec integration by @tomaarsen in #3182
Don't call set_base_model when local_files_only=True by @Davidyz in #3202

New Contributors

@dependabot made their first contribution in #3192
@altescy made their first contribution in #3196
@Davidyz made their first contribution in #3202

Full Changelog: v3.4.0...v3.4.1

@tomaarsen

This release resolves a memory leak when deleting a model & trainer, adds compatibility between the Cached... losses and the Matryoshka loss modifier, resolves numerous bugs, and adds several small features.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==3.4.0

# Inference only, use one of:
pip install sentence-transformers==3.4.0
pip install sentence-transformers[onnx-gpu]==3.4.0
pip install sentence-transformers[onnx]==3.4.0
pip install sentence-transformers[openvino]==3.4.0

Matryoshka & Cached loss compatibility (#3068, #3107)

It is now possible to combine the strong Cached losses (CachedMultipleNegativesRankingLoss, CachedGISTEmbedLoss, CachedMultipleNegativesSymmetricRankingLoss) with the Matryoshka loss modifier:

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({
    "anchor": ["It's nice weather outside today.", "He drove to work."],
    "positive": ["It's so sunny.", "He took the car to the office."],
})
loss = losses.CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)
loss = losses.MatryoshkaLoss(model, loss, [768, 512, 256, 128, 64])

trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

See for example tomaarsen/mpnet-base-gooaq-cmnrl-mrl which was trained with CachedMultipleNegativesRankingLoss (CMNRL) with the Matryoshka loss modifier (MRL).

Resolve memory leak when Model and Trainer are reinitialized (#3144)

Due to a circular dependency in the SentenceTransformerTrainer -> SentenceTransformer -> SentenceTransformerModelCardData -> SentenceTransformerTrainer, deleting the trainer and model still doesn't clear them up via garbage disposal. I've moved a lot of components around, and now SentenceTransformerModelCardData does not need to store the SentenceTransformerTrainer, breaking the cycle.

We ran the seed optimization script (which frequently creates and deletes models and trainers):

Before: Approximate highest recorded VRAM:
```
16332MiB /  24576MiB
```
After: Approximate highest recorded VRAM:
```
8222MiB /  24576MiB
```

Small Features

Add Matthews Correlation Coefficient to the BinaryClassificationEvaluator in #3051.
Add a triplet margin parameter to the TripletEvaluator in #2862.
Put dataset information in the automatically generated model card in "expanding sections" blocks if there are many datasets in #3088.
Add multi-GPU (and CPU multi-process) support for mine_hard_negatives in #2967.

Notable Bug Fixes

Subsequent batches were identical when using the no_duplicates Batch Sampler (#3069). This has been resolved in #3073
The old-style model.fit() training with write_csv on an evaluator would crash (#3062). This has been resolved in #3066.
The output types of some evaluators were np.float instead of float (#3075). This has been resolved in #3076 and #3096.
It was not possible to specify a revision or cache_dir when loading a PEFT Adapter model (#3061). This has been resolved in #3079 and #3174.
The CrossEncoder was lazily placed on the incorrect device, did not respond to model.to (#3078). This has been resolved in #3104.
If a model used a custom module with custom kwargs, those kwargs keys were not saved in modules.json correctly, e.g. relevant for jina-embeddings-v3 (#3111). This has been resolved in #3112.
HfArgumentParser(SentenceTransformerTrainingArguments) would crash due to prompts typing (#3090). This has been resolved in #3178.

Example Updates

Update the quantization script in #3070.
Update the seed optimization script in #3092.
Update the TSDAE scripts in #3137.
Add PEFT Adapter script in #3180.

Documentation Updates

Add PEFT Adapter documentation in #3180.
Add links to backend-export in Speeding up Inference.

All Changes

[training] Pass steps/epoch/output_path to Evaluator during training by @tomaarsen in #3066
[examples] Update the quantization script by @tomaarsen in #3070
[fix] Fix different batches per epoch in NoDuplicatesBatchSampler by @tomaarsen in #3073
[docs] Add links to backend-export in Speeding up Inference by @tomaarsen in #3071
add MCC to BinaryClassificationEvaluator by @JINO-ROHIT in #3051
support cached losses in combination with matryoshka loss by @Marcel256 in #3068
align model_card_templates.py with code by @amitport in #3081
converting np float result to float in binary classification evaluator by @JINO-ROHIT in #3076
Add triplet margin for distance functions in TripletEvaluator by @zivicmilos in #2862
[model_card] Keep the model card readable even with many datasets by @tomaarsen in #3088
[docs] Add NanoBEIR to the Training Overview evaluators by @tomaarsen in #3089
[fix] revision of the adapter model can now be specified. by @pesuchin in #3079
[docs] Update from Sphinx==3.5.4 to 8.1.3, recommonmark -> myst-parser by @tomaarsen in #3099
normalize to float in NanoBEIREvaluator, InformationRetrievalEvaluator, MSEEvaluator by @JINO-ROHIT in #3096
[docs] List 'prompts' as a key training argument by @tomaarsen in #3101
revert float type cast manually in BinaryClassificationEvaluator by @JINO-ROHIT in #3102
update train_sts_seed_optimization with SentenceTransformerTrainer by @JINO-ROHIT in #3092
Fix cross encoder device issue by @susnato in #3104
[enhancement] Make MultipleNegativesRankingLoss easier to understand by @tomaarsen in #3100
[fix] Fix breaking change in PyLate when loading modules by @tomaarsen in #3110
multi-GPU support for mine_hard_negatives by @alperctnkaya in #2967
raises error when dataset is an empty list in NanoBEIREvaluator by @JINO-ROHIT in #3122
Added a note to the documentation stating that the similarity method does not support embeddings other than non-quantized ones. by @pesuchin in #3131
[typo] Add missing space between sentences in error message by @tomaarsen in #3125
raises ValueError when num_label !=1 when using Crossencoder.rank() by @JINO-ROHIT in #3126
fix backward pass for cached losses by @Marcel256 in #3114
Adding evaluation checks to prevent Transformer ValueError by @stsfaroz in #3105
[typo] Fix incorrect spelling for "corpus" by @ignasgr in #3154
[fix] Save custom module kwargs if specified by @tomaarsen in #3112
[memory] Avoid storing trainer in ModelCardCallback and SentenceTransformerModelCardData by @tomaarsen in #3144
Suport for embedded representation by @Radu1999 in #3156
[DRAFT] tests for nanobeir evaluator by @JINO-ROHIT in #3127
Update TSDAE examples with SentenceTransformerTrainer by @JINO-ROHIT in #3137
[docs] Update the Static Embedding example snippet by @tomaarsen in #3177
fix: propagate cache dir to find adapter by @lauralehoczki11 in #3174
[fix] Use HfArgumentParser-compatible typing for prompts by @tomaarsen in #3178
testcases for community detection by @JINO-ROHIT in #3163
...

@tomaarsen

This patch release fixes a small issue with loading private models from Hugging Face using the token argument.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==3.3.1

# Inference only, use one of:
pip install sentence-transformers==3.3.1
pip install sentence-transformers[onnx-gpu]==3.3.1
pip install sentence-transformers[onnx]==3.3.1
pip install sentence-transformers[openvino]==3.3.1

Details

If you're loading model under this scenario:

Your model is hosted on Hugging Face.
Your model is private.
You haven't set the HF_TOKEN environment variable via huggingface-cli login or some other approach.
You're passing the token argument to SentenceTransformer to load the model.

Then you may have encountered a crash in v3.3.0. This should be resolved now.

All Changes

[docs] Fix the prompt link to the training script by @tomaarsen in #3060
[Fix] Resolve loading private Transformer model in version 3.3.0 by @pesuchin in #3058

Full Changelog: v3.3.0...v3.3.1

4x speedup for CPU with OpenVINO int8 static quantization, training with prompts for a free performance boost, convenient evaluation on NanoBEIR: a subset of a strong Information Retrieval benchmark, PEFT compatibility by easily adding/loading adapters, Transformers v4.46.0 compatibility, and Python 3.8 deprecation.

Install this version with:

# Training + Inference
pip install sentence-transformers[train]==3.3.0

# Inference only, use one of:
pip install sentence-transformers==3.3.0
pip install sentence-transformers[onnx-gpu]==3.3.0
pip install sentence-transformers[onnx]==3.3.0
pip install sentence-transformers[openvino]==3.3.0

OpenVINO int8 static quantization (#3025)

We introduce int8 static quantization using OpenVINO, a highly performant solution that outperforms all other current backends by a mile, at a minimal loss in performance. Here are the updated benchmarks:

Quantizing directly to the Hugging Face Hub

from sentence_transformers import SentenceTransformer, export_static_quantized_openvino_model

# 1. Load a model with the OpenVINO backend
model = SentenceTransformer("all-MiniLM-L6-v2", backend="openvino")

# 2. Quantize the model to int8, push the model to https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
# as a pull request:
export_static_quantized_openvino_model(
    model,
    quantization_config=None,
    model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",
    push_to_hub=True,
    create_pr=True,
)

You can immediately use the model, even before it's merged, by using the revision argument:

from sentence_transformers import SentenceTransformer

pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = SentenceTransformer(
    "all-MiniLM-L6-v2",
    backend="openvino",
    model_kwargs={"file_name": "openvino_model_qint8_quantized.xml"},
    revision=f"refs/pr/{pull_request_nr}"
)

And once it's merged:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "all-MiniLM-L6-v2",
    backend="openvino",
    model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"},
)

Quantizing locally

You can also quantize a model and save it locally:

from sentence_transformers import SentenceTransformer, export_static_quantized_openvino_model
from optimum.intel import OVQuantizationConfig

model = SentenceTransformer("all-mpnet-base-v2", backend="openvino")
model.save_pretrained("path/to/all-mpnet-base-v2-local")
quantization_config = OVQuantizationConfig() # <- You can update settings here
export_static_quantized_openvino_model(model, quantization_config, "path/to/all-mpnet-base-v2-local")

And after quantizing, you can load it like so:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "path/to/all-mpnet-base-v2-local",
    backend="openvino",
    model_kwargs={"file_name": "openvino_model_qint8_quantized.xml"},
)

All original Sentence Transformer models already have these new openvino_model_qint8_quantized.xml files, so you can load them without exporting directly! I would recommend making pull requests for other models on Hugging Face that you'd like to see quantized.

Learn more about how to Speed up Inference in the documentation: https://sbert.net/docs/sentence_transformer/usage/efficiency.html

Training with Prompts (#2964)

Many modern embedding models are trained with “instructions” or “prompts” following the INSTRUCTOR paper. These prompts are strings, prefixed to each text to be embedded, allowing the model to distinguish between different types of text.

For example, the mixedbread-ai/mxbai-embed-large-v1 model was trained with Represent this sentence for searching relevant passages: as the prompt for all queries. This prompt is stored in the model configuration under the prompt name "query", so users can specify that prompt_name in model.encode:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
query_embedding = model.encode("What are Pandas?", prompt_name="query")
# or
# query_embedding = model.encode("What are Pandas?", prompt="Represent this sentence for searching relevant passages: ")
document_embeddings = model.encode([
    "Pandas is a software library written for the Python programming language for data manipulation and analysis.",
    "Pandas are a species of bear native to South Central China. They are also known as the giant panda or simply panda.",
    "Koala bears are not actually bears, they are marsupials native to Australia.",
])
similarity = model.similarity(query_embedding, document_embeddings)
print(similarity)
# => tensor([[0.7594, 0.7560, 0.4674]])

Various papers (INSTRUCTOR, BGE) show that including prompts or instructions both during training and inference results in stronger performance. As of this release, it's now possible to easily train with prompts in Sentence Transformers with just one extra training argument: prompts. There are 4 accepted formats for it:

str: A single prompt to use for all columns in all datasets. For example:

args = SentenceTransformerTrainingArguments(
    ...,
    prompts="text: ",
    ...,
)

Dict[str, str]: A dictionary mapping column names to prompts, applied to all datasets. For example:

args = SentenceTransformerTrainingArguments(
    ...,
    prompts={
        "query": "query: ",
        "answer": "document: ",
    },
    ...,
)

Dict[str, str]: A dictionary mapping dataset names to prompts. This should only be used if your training/evaluation/test datasets are a DatasetDict or a dictionary of Dataset. For example:

args = SentenceTransformerTrainingArguments(
    ...,
    prompts={
        "stsb": "Represent this text for semantic similarity search: ",
        "nq": "Represent this text for retrieval: ",
    },
    ...,
)

Dict[str, Dict[str, str]]: A dictionary mapping dataset names to dictionaries mapping column names to prompts. This should only be used if your training/evaluation/test datasets are a DatasetDict or a dictionary of Dataset. For example:

args = SentenceTransformerTrainingArguments(
    ...,
    prompts={
        "stsb": {
            "sentence1": "sts: ",
            "sentence2": "sts: ",
        },
        "nq": {
            "query": "query: ",
            "document": "document: ",
        },
    },
    ...,
)

I've trained models with and without prompts for 2 base models: mpnet-base and bert-base-uncased:

For both base models, the model with prompts consistently outperformed the baseline model. After training, the models with prompts resulted in a 0.66% and 0.90% relative improvement on NDCG@10 at no extra cost.

`mpnet-base` tests	`bert-base-uncased` tests

Training with Prompts documentation: https://sbert.net/examples/training/prompts/README.html
Training with Prompts training script: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/prompts/training_nq_prompts.py

NanoBEIR Evaluator integration (#2966)

This update introduced a new simple NanoBEIREvaluator, evaluating your model against NanoBEIR: a collection of subsets of the 13 BEIR datasets. BEIR corresponds to the retrieval tab of MTEB, and is commonly seen as a valuable indicator of general-purpose information retrieval performance.

With the NanoBEIREvaluator, you can easily evaluate your models on a much faster benchmark that should give similar insights in performance...

@echarlaix

This patch release fixes some small bugs, such as related to loading CLIP models, automatic model card generation issues, and ensuring compatibility with third party libraries.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==3.2.1

# Inference only, use one of:
pip install sentence-transformers==3.2.1
pip install sentence-transformers[onnx-gpu]==3.2.1
pip install sentence-transformers[onnx]==3.2.1
pip install sentence-transformers[openvino]==3.2.1

Fixing Loading non-Transformer models

In v3.2.0, a non-Transformer based model (e.g. CLIP) would not load correctly if the model was saved in the root of the model repository/directory. This has been resolved in #3007.

Throw error if `StaticEmbedding`-based model is finetuned with incompatible losses

The following losses are not compatible with StaticEmbedding-based models:

CachedGISTEmbedLoss
CachedMultipleNegativesRankingLoss
CachedMultipleNegativesSymmetricRankingLoss
DenoisingAutoEncoderLoss
GISTEmbedLoss

An error is now thrown when one of these are used with a StaticEmbedding-based model. I recommend using MultipleNegativesRankingLoss to finetune these models, e.g. as in https://huggingface.co/tomaarsen/static-bert-uncased-gooaq.
Note: to get good performance, you must use much higher learning rates than otherwise. In my experiments, 2e-1 worked well.

Patch ONNX model when the model uses `output_hidden_states`

For example, this script used to fail, but passes now:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "distiluse-base-multilingual-cased",
    backend="onnx",
    model_kwargs={"provider": "CPUExecutionProvider"},
)

sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings.shape)

All changes

Bump optimum version by @echarlaix in #2984
[docs] Update the training snippets for some losses that should use the v3 Trainer by @tomaarsen in #2987
[enh] Throw error if StaticEmbedding-based model is trained with incompatible loss by @tomaarsen in #2990
[fix] Fix semantic_search_usearch with 'binary' by @tomaarsen in #2989
[enh] Add support for large_string in model card create by @yaohwang in #2999
[model cards] Prevent crash on generating widgets if dataset column is empty by @tomaarsen in #2997
[fix] Added model2vec import compatible with current and newer version by @Pringled in #2992
Fix cache_dir issue with loading CLIPModel by @BoPeng in #3007
[warn] Throw a warning if compute_metrics is set, as it's not used by @tomaarsen in #3002
[fix] Prevent IndexError if output_hidden_states & ONNX by @tomaarsen in #3008

New Contributors

@echarlaix made their first contribution in #2984
@yaohwang made their first contribution in #2999
@Pringled made their first contribution in #2992
@BoPeng made their first contribution in #3007

Full Changelog: v3.2.0...v3.2.1

Releases: UKPLab/sentence-transformers

v5.1.0 - ONNX and OpenVINO backends offering 2-3x speedups; more hard negatives mining formats

Faster ONNX and OpenVINO backends for SparseEncoder models (#3475)

Benchmarks

New n-tuple-scores output format from mine_hard_negatives (#3430, #3481)

Contributors

Uh oh!

v5.0.0 - SparseEncoder support; encode_query & encode_document; multi-processing in encode; Router; and more

Sparse Encoder models

Are they any good?

Contributors

Uh oh!

v4.1.0 - ONNX and OpenVINO backends offering 2-3x speedups; improved hard negatives mining

Faster ONNX and OpenVINO Backends for CrossEncoder (#3319)

Benchmarks

ONNX Optimization

ONNX Quantization

Contributors

Uh oh!

v4.0.2 - Safer reranker max sequence length logic, typing issues, FSDP & device placement

Safer CrossEncoder (reranker) maximum sequence length

Default device placement (#3303)

Minor patches of note

All Changes

New Contributors

Contributors

Uh oh!

v4.0.1 - Reranker (Cross Encoder) Training Refactor; new losses, docs, examples, etc.

Reranker (Cross Encoder) training refactor (#3222)

Is finetuning worth it?

Resources:

Contributors

Uh oh!

v3.4.1 - Model2Vec compatibility & offline model fix

Full Model2Vec integration

Bug Fix

All Changes

New Contributors

Contributors

Uh oh!

v3.4.0 - Resolved memory leak when deleting a model & trainer; add Matryoshka & Cached loss compatibility; small features & bug fixes

Matryoshka & Cached loss compatibility (#3068, #3107)

Resolve memory leak when Model and Trainer are reinitialized (#3144)

Small Features

Notable Bug Fixes

Example Updates

Documentation Updates

All Changes

Contributors

Uh oh!

v3.3.1 - Patch private model loading without environment variable

Details

All Changes

Contributors

Uh oh!

v3.3.0 - Massive CPU speedup with OpenVINO int8 quantization; Training with Prompts for stronger models; NanoBEIR IR evaluation; PEFT compatibility; Transformers v4.46.0 compatibility

OpenVINO int8 static quantization (#3025)

Quantizing directly to the Hugging Face Hub

Quantizing locally

Training with Prompts (#2964)

NanoBEIR Evaluator integration (#2966)

Contributors

Uh oh!

v3.2.1 - Patch CLIP loading, small ONNX fix, compatibility with other libraries

Fixing Loading non-Transformer models

Throw error if StaticEmbedding-based model is finetuned with incompatible losses

Patch ONNX model when the model uses output_hidden_states

All changes

New Contributors

Contributors

Uh oh!

New `n-tuple-scores` output format from `mine_hard_negatives` (#3430, #3481)

Throw error if `StaticEmbedding`-based model is finetuned with incompatible losses

Patch ONNX model when the model uses `output_hidden_states`