[`v4`] CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc. #3222

tomaarsen · 2025-02-07T14:41:10Z

Hello!

Pull Request overview

Introduce a new training loop, using a CrossEncoderTrainer, CrossEncoderTrainingArguments, and loss functions. Brings features such as:
- Multi-GPU training (DP, DDP), multi-dataset training, bf16, loss logging (Weights & Biases, Tensorboard, the terminal, etc.), many additional hyperparameters, automatic model card generation, etc.
A new MultipleNegativesRankingLoss and CachedMultipleNegativesRankingLoss (a.k.a. InfoNCE and InfoNCE with GradCache) to train with anchor-positive pairs, anchor-positive-negative triplets, and anchor-positive-negative1-...-negativeN tuples.

TODOs

Details

Overall, the goal of this refactor is to introduce feature parity between the Cross Encoder training and the Sentence Transformer training. Luckily, the work done for the ST trainer can be extended rather easily, so the refactor is not as big as it was for the SentenceTransformer class in v3.0.

Notably, training now centers around:

a training Dataset or DatasetDict. This class is much more suited for sharing & efficient modifications than lists/DataLoaders of InputExample instances. A Dataset can contain multiple text columns that will be fed in order to the corresponding loss function. So, if the loss expects (anchor, positive, negative) triplets, then your dataset should also have 3 columns. The names of these columns are irrelevant at this time. If there is a "label" column, it is treated separately, and used as the labels during training.
A DatasetDict can be used to train with multiple datasets at once, e.g.:
```
DatasetDict({
    multi_nli: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 392702
    })
    snli: Dataset({
        features: ['snli_premise', 'hypothesis', 'label'],
        num_rows: 549367
    })
    stsb: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 5749
    })
})
```
When a DatasetDict is used, the loss parameter to the CrossEncoderTrainer must also be a dictionary with these dataset keys, e.g.:
```
{
    'multi_nli': BinaryCrossEntropyLoss(...),
    'snli': BinaryCrossEntropyLoss(...),
    'stsb': CrossEntropyLoss(...),
}
```
By default, these are sampled from in proportion to their sizes.
A loss function, or a dictionary of loss functions like described above. We now support highly customizable losses, much more than before. The loss can now e.g. make pairs dynamically, as the loss gets "raw" texts, not pretokenized inputs.
A CrossEncoderTrainingArguments instance, subclass of a SentenceTransformerTrainingArguments instance. This powerful class controls the specific details of the training.
An optional SentenceEvaluator instance. These instances either return a float, or a dictionary with metric keys and values. If the latter, the class must also defined evaluator.primary_metric so e.g. the "best model" checkpointing can be based on an evaluator score.
Models can now be evaluated both on an evaluation dataset with some loss function and/or a SentenceEvaluator instance.
The new CrossEncoderTrainer instance. This instance is provided with a CrossEncoder model, a CrossEncoderTrainingArguments class, a SentenceEvaluator, a training and evaluation Dataset/DatasetDict and a loss function/dict of loss functions. Most of these parameters are optional. Once provided, all you have to do is call train().

This is an example of an extensive training script with all of the features at play:

import logging
from datetime import datetime

from datasets import load_dataset

from sentence_transformers.cross_encoder import CrossEncoder
from sentence_transformers.cross_encoder.evaluation import CEClassificationEvaluator
from sentence_transformers.cross_encoder.losses.CrossEntropyLoss import CrossEntropyLoss
from sentence_transformers.cross_encoder.trainer import CrossEncoderTrainer
from sentence_transformers.cross_encoder.training_args import CrossEncoderTrainingArguments

# Set the log level to INFO to get more information
logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)

train_batch_size = 64
num_epochs = 1
output_dir = "output/training_ce_allnli-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# 1. Define our CrossEncoder model. We use distilroberta-base as basis and setup it up to predict 3 labels
# You can also use other base models, like bert-base-uncased, microsoft/mpnet-base, etc.
model = CrossEncoder("distilroberta-base", num_labels=3)

# 2. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
# We'll start with 10k training samples, but you can increase this to get a stronger model
logging.info("Read AllNLI train dataset")
train_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="train").select(range(10000))
eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev").select(range(1000))
test_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="test")
logging.info(train_dataset)

# We might want to remap labels from the dataset, you can do that like so:
mapping = {0: 1, 1: 2, 2: 0}
eval_dataset = eval_dataset.map(lambda x: {"label": mapping[x["label"]]})
test_dataset = test_dataset.map(lambda x: {"label": mapping[x["label"]]})

# 3. Define our training loss:
loss = CrossEntropyLoss(model)

# During training, we use CEClassificationEvaluator to measure the performance on the dev set
dev_cls_evaluator = CEClassificationEvaluator(
    list(zip(eval_dataset["premise"], eval_dataset["hypothesis"])),
    eval_dataset["label"],
    name="AllNLI-dev",
)
dev_cls_evaluator(model)

# 5. Define the training arguments
args = CrossEncoderTrainingArguments(
    # Required parameter:
    output_dir=output_dir,
    # Optional training parameters:
    num_train_epochs=num_epochs,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=train_batch_size,
    warmup_ratio=0.1,
    fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=True,  # Set to True if you have a GPU that supports BF16
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,
    logging_steps=100,
    run_name="ce-nli-v1",  # Will be used in W&B if `wandb` is installed
)

# 6. Create the trainer & start training
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=dev_cls_evaluator,
)
trainer.train()

# 7. Evaluate the final model on test dataset
test_cls_evaluator = CEClassificationEvaluator(
    list(zip(test_dataset["premise"], test_dataset["hypothesis"])),
    test_dataset["label"],
    name="AllNLI-test",
)
test_cls_evaluator(model)

# 8. Evaluate the final model and save it
final_output_dir = f"{output_dir}/final"
model.save_pretrained(final_output_dir)

As you may note, it is very similar to the new SentenceTransformer flow: datasets Dataset, standalone loss with a lot more flexibility than before, a TrainingArguments and Trainer class, Evaluators much like before & as used in SentenceTransformer training, etc.

cc @milistu as you're also working on CrossEncoders
cc @LysandreJik

Tom Aarsen

into feat/cross_encoder_trainer

…n Trainer

milistu · 2025-02-15T17:22:25Z

Hi @tomaarsen

I've used your fork and branch as a base and added my implementation for ListNet Loss.

Changes

Updated CrossEncoderDataCollator
- Added new values in: valid_label_columns.
- Decorated CrossEncoderDataCollator with @dataclass to ensure changes take effect correctly. Previously, it was using variables from SentenceTransformerDataCollator even after modification. Now everything works as expected.
Implemented ListNetLoss
- Added ListNetLoss implementation.
- Updated __init__.py to include the new loss function.
Added an example script
- Demonstrates how to prepare a dataset and use ListNetLoss with the MS MARCO dataset.

🔗 Link to My Branch

feat/cross_encoder_trainer

Let me know what you think or if any modifications are needed!

tomaarsen · 2025-02-16T07:22:39Z

Hello!

This is excellent work, looks very solid! Have you been able to run the training script yourself so far? I can also try and run it and upload the finished model.
Edit: I trained one: https://huggingface.co/tomaarsen/reranker-msmarco-v1.1-MiniLM-L12-H384-uncased-listnet

In the coming days I can try and merge your work into this PR.

Tom Aarsen

tomaarsen · 2025-02-17T08:44:56Z

It's interesting to see that although the model does get better than the BM25 baseline, the loss effectively does not change.

milistu · 2025-02-17T12:42:23Z

Hi @tomaarsen 👋

I successfully trained the model and experimented with hyperparameters.

Trained Model

You can find the trained model here:
Studeni/reranker-msmarco-v1.1-ModernBERT-base-listnet

Observations on Loss

I noticed that the loss is slightly higher (~2.0). Through research and testing, I found that this discrepancy arises due to differences in distribution:

Ground truth labels are binary, which results in a sharp distribution.
Our CrossEncoder outputs discrete values, leading to a different distribution compared to the ground truth.
This mismatch explains the higher loss value.

One possible solution is to apply a transformation to the predicted distribution to better align it with the ground truth. However, for now, I think this is sufficient. Instead of refining this approach further, I’d prefer to integrate more listwise loss functions that are known to outperform ListNet.

Additionally, I assume that combining MSE and ListNet loss could yield better results by leveraging the strengths of both approaches. I can explore this further.

Issue with Missing Values in Evaluation CSV

While training, I noticed missing values in the evaluation CSVs.

Example from CERerankingEvaluator_NanoMSMARCO_results_@10.csv:

epoch,steps,MAP,MRR@10,NDCG@10
0.17214666896195557,2000,0.04166666666666666,0.0654804137172179
0.34429333792391115,4000,0.2629365079365079,0.32231584858096596
0.5164400068858668,6000,0.417079365079365,0.4742922970864457
0.6885866758478223,8000,0.45016666666666666,0.5006481726476146
0.8607333448097779,10000,0.49913492063492065,0.5646165118665413
1.0328800137717336,12000,0.49426984126984125,0.5609090407025468
1.205026682733689,14000,0.4636349206349206,0.5413296770100868
1.3771733516956446,16000,0.4563809523809524,0.5233662617261322

Here, we expect five columns, but only four values appear in some rows.
Initially, I suspected that NDCG was missing, but after further inspection, I believe MAP is missing. This issue occurs in all evaluation CSVs.

Has this happened in your training as well? I used the same (or a very similar) setup from your MS MARCO training example.

tomaarsen · 2025-02-17T12:59:13Z

I'll investigate the CSV issue, that one is definitely on me.

Beyond that, I've implemented an activation_fct parameter to all losses that post-process the logits to try and help with the distributions. For example, with MultipleNegativesRankingLoss (a.k.a. InfoNCE loss) I use Tanh based on https://arxiv.org/abs/2407.19669 to map the predictions as scores between -1 and 1 before using them in Cross Entropy Loss. Perhaps something like that is required for ListNetLoss as well?

I'm definitely open to other Listwise loss implementations! I'm currently looking into improving mine_hard_negatives so that it can be used to generate training and evaluation data with larger datasets.

Tom Aarsen

yjoonjang · 2025-03-17T06:12:00Z

Hi @milistu, @tomaarsen
I don't know this is a right way to contribute, but I've implemetned ListMLELoss to @tomaarsen's cross_encoder_trainer branch.

About ListMLELoss

ListMLE is a listwise learning-to-rank loss function that directly optimizes the likelihood of the correct permutation of documents. It models the probability of a permutation using the Plackett-Luce model, which sequentially selects items based on their scores.
The key difference between ListMLE and ListNet is in how they model the ranking problem:

ListNet uses a softmax to convert scores to probability distributions and minimizes the cross-entropy between the predicted distribution and the ground truth distribution. It focuses on the probability of items being ranked at the top position.
ListMLE directly models the probability of the entire permutation and maximizes the likelihood of the correct ordering. It captures the sequential nature of the ranking process, where each position is filled one by one.

Additionally, I've implemented Position-Aware ListMLE with lambda weighting, which applies different weights to different rank positions. This allows the model to focus more on getting the top positions correct, which is often more important in ranking tasks.

This loss is particularly valuable when dealing with multiple relevant documents that have a clear preference order. For example, when training a reranker for tool selection, some tools should be ranked higher than others for a given query, even though both are relevant. ListMLE can effectively learn these nuanced preferences by modeling the entire permutation probability, ensuring that the most suitable tool appears first in the ranking, followed by the second-best option, and so on.

PR

tomaarsen#6

Youngjoon Jang

* Add ListMLELoss * Fix input_order not being considered * Update init.py * Add training scripts for ListMLELoss * Fix self.lambda_weight to ListMLELambdaWeight Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> * Refactor conditional logic in ListMLELambdaWeight Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> * Fix Delete unused function - create_p_list_mle_lambda_weight * Refactor mask creation using zeros_like Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> * Refactor for-loop with vectorized operations when applying position weights Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> * Fix reference and citation * Refactor to seperate PListMLE and ListMLE * Refactor training scripts for PListMLE and ListMLE * Add information of data to be sorted in a defined rank order * Run formatting * Ensure that paddings are excluded in the loss * Remove lambda_weight as an option from ListMLELoss * Update documentation throughout * Update MS MARCO training examples docs * Fix anecdotal ranking * Vectorize the lambda weight computation It should be equivalent, and considerably faster (although it wasn't necessarily a bottleneck) * Add +1 to PListMLELambdaWeight; normalize weight by divide-by-sum * Simplify code, remove duplicate +1 * Update get_config_dict for ListMLELoss to remove lambda_weight --------- Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>

yjoonjang · 2025-03-19T17:16:42Z

Hi @tomaarsen, @milistu

I was wondering if you would also like to implement RankNetLoss (a.k.a pairwise logistic loss)
This loss represents the 'pairwise loss', and is also used in RankLLM (https://github.com/castorini/rank_llm)

I've actually worked on the code and did some experiments about activation functions.

code

from __future__ import annotations

from typing import Literal

import torch
from torch import Tensor, nn

from sentence_transformers.cross_encoder import CrossEncoder
from sentence_transformers.util import fullname


class RankNetLoss(nn.Module):
    def __init__(
        self,
        model: CrossEncoder,
        sigma: float = 1.0,
        eps: float = 1e-10,
        activation_fct: nn.Module | None = nn.Identity(),
        mini_batch_size: int | None = None,
    ) -> None:
        """
        RankNet loss implementation for learning to rank. This loss function implements the RankNet algorithm,
        which learns a ranking function by optimizing pairwise document comparisons using a neural network.
        The implementation is optimized to handle padded documents efficiently by only processing valid
        documents during model inference.

        Args:
            model (CrossEncoder): CrossEncoder model to be trained
            sigma (float): Score difference weight used in sigmoid (default: 1.0)
            eps (float): Small constant for numerical stability (default: 1e-10)
            activation_fct (:class:`~torch.nn.Module`): Activation function applied to the logits before computing the
                loss. Defaults to :class:`~torch.nn.Identity`.
            mini_batch_size (int, optional): Number of samples to process in each forward pass. This has a significant
                impact on the memory consumption and speed of the training process. Three cases are possible:
                - If ``mini_batch_size`` is None, the ``mini_batch_size`` is set to the batch size.
                - If ``mini_batch_size`` is greater than 0, the batch is split into mini-batches of size ``mini_batch_size``.
                - If ``mini_batch_size`` is <= 0, the entire batch is processed at once.
                Defaults to None.

        References:
            - Learning to Rank using Gradient Descent: https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf

        Requirements:
            1. Query with multiple documents (pairwise approach)
            2. Documents must have relevance scores/labels. Both binary and continuous labels are supported.

        Inputs:
            +----------------------------------------+--------------------------------+-------------------------------+
            | Texts                                  | Labels                         | Number of Model Output Labels |
            +========================================+================================+===============================+
            | (query, [doc1, doc2, ..., docN])       | [score1, score2, ..., scoreN]  | 1                             |
            +----------------------------------------+--------------------------------+-------------------------------+

        Example:
            ::

                from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses
                from datasets import Dataset

                model = CrossEncoder("microsoft/mpnet-base")
                train_dataset = Dataset.from_dict({
                    "query": ["What are pandas?", "What is the capital of France?"],
                    "docs": [
                        ["Pandas are a kind of bear.", "Pandas are kind of like fish."],
                        ["The capital of France is Paris.", "Paris is the capital of France.", "Paris is quite large."],
                    ],
                    "labels": [[1, 0], [1, 1, 0]],
                })
                loss = losses.RankNetLoss(model)

                trainer = CrossEncoderTrainer(
                    model=model,
                    train_dataset=train_dataset,
                    loss=loss,
                )
                trainer.train()
        """
        super().__init__()
        self.model = model
        self.sigma = sigma
        self.eps = eps
        self.activation_fct = activation_fct or nn.Identity()
        self.mini_batch_size = mini_batch_size

        if self.model.num_labels != 1:
            raise ValueError(
                f"{self.__class__.__name__} supports a model with 1 output label, "
                f"but got a model with {self.model.num_labels} output labels."
            )

    def forward(self, inputs: list[list[str], list[list[str]]], labels: list[Tensor]) -> Tensor:
        """
        Compute RankNet loss for a batch of queries and their documents.

        Args:
            inputs: List of (queries, documents_list)
            labels: Ground truth relevance scores, shape (batch_size, num_documents)

        Returns:
            Tensor: RankNet loss over the batch
        """
        if isinstance(labels, Tensor):
            raise ValueError(
                "RankNetLoss expects a list of labels for each sample, but got a single value for each sample."
            )
        if len(inputs) != 2:
            raise ValueError(f"RankNetLoss expects two inputs (queries, documents_list), but got {len(inputs)} inputs.")

        queries, docs_list = inputs
        docs_per_query = [len(docs) for docs in docs_list]
        max_docs = max(docs_per_query)
        batch_size = len(queries)

        if docs_per_query != [len(labels) for labels in labels]:
            raise ValueError(
                f"Number of documents per query in inputs ({docs_per_query}) does not match number of labels per query ({[len(labels) for labels in labels]})."
            )

        # Create input pairs for the model₩
        pairs = [(query, document) for query, docs in zip(queries, docs_list) for document in docs]

        if not pairs:
            # Handle edge case where all documents are padded
            return torch.tensor(0.0, device=self.model.device, requires_grad=True)

        mini_batch_size = self.mini_batch_size or batch_size
        if mini_batch_size <= 0:
            mini_batch_size = len(pairs)

        logits_list = []
        for i in range(0, len(pairs), mini_batch_size):
            mini_batch_pairs = pairs[i : i + mini_batch_size]

            tokens = self.model.tokenizer(
                mini_batch_pairs,
                padding=True,
                truncation=True,
                return_tensors="pt",
            )
            tokens = tokens.to(self.model.device)

            logits = self.model(**tokens)[0].view(-1)
            logits_list.append(logits)

        logits = torch.cat(logits_list, dim=0)
        logits = self.activation_fct(logits)

        # Create output tensor filled with 0 (padded logits will be ignored via labels)
        logits_matrix = torch.full((batch_size, max_docs), -1e16, device=self.model.device)

        # Place logits in the desired positions in the logit matrix
        doc_indices = torch.cat([torch.arange(len(docs)) for docs in docs_list], dim=0)
        batch_indices = torch.repeat_interleave(torch.arange(batch_size), torch.tensor(docs_per_query))
        logits_matrix[batch_indices, doc_indices] = logits

        # Create labels matrix
        labels_matrix = torch.full_like(logits_matrix, float("-inf"))
        labels_matrix[batch_indices, doc_indices] = torch.cat(labels, dim=0).float()
        labels_matrix = labels_matrix.to(self.model.device)

        # Calculate pairwise differences for scores and labels
        score_diffs = logits_matrix[:, :, None] - logits_matrix[:, None, :]
        label_diffs = labels_matrix[:, :, None] - labels_matrix[:, None, :]

        # Create mask for valid pairs (where both documents are not padded)
        valid_pairs = torch.isfinite(label_diffs)
        
        # Create mask for pairs where l_i > l_j
        positive_pairs = label_diffs > 0

        # Calculate probabilities and target probabilities
        P_ij = torch.sigmoid(self.sigma * score_diffs)
        P_ij = torch.clamp(P_ij, min=self.eps, max=1-self.eps)
        
        # Calculate loss only for pairs where l_i > l_j (positive_pairs)
        losses = -torch.log(P_ij)

        # Apply masks and compute mean loss
        masked_loss = losses[valid_pairs & positive_pairs]

        # Handle case when there are no positive pairs
        if masked_loss.numel() == 0:
            return torch.tensor(0.0, device=self.model.device, requires_grad=True)

        loss = torch.mean(masked_loss)

        return loss

    def get_config_dict(self) -> dict[str, float | int | str | None]:
        """
        Get configuration parameters for this loss function.

        Returns:
            Dictionary containing the configuration parameters
        """
        return {
            "sigma": self.sigma,
            "eps": self.eps,
            "activation_fct": fullname(self.activation_fct),
            "mini_batch_size": self.mini_batch_size,
        }

    @property
    def citation(self) -> str:
        return """
@inproceedings{burges2005learning,
  title={Learning to rank using gradient descent},
  author={Burges, Chris and Shaked, Tal and Renshaw, Erin and Lazier, Ari and Deeds, Matt and Hamilton, Nicole and Hullender, Greg},
  booktitle={Proceedings of the 22nd international conference on Machine learning},
  pages={89--96},
  year={2005}
}
"""

The result:

To simplify, the best recipe for training the reranker model was just using the Identity function.

model: yjoonjang/reranker-msmarco-v1.1-MiniLM-L12-H384-uncased-ranknetloss
result:

Metric	Value
map	0.4775 (+0.0874)
mrr@10	0.5597 (+0.0917)
ndcg@10	0.5298 (+0.0745)

However, looking at LambdaLoss, I found out that RankNetLoss is same with LambdaLoss when using NoWeightingScheme.

So I trained the model with

loss = LambdaLoss(
        model=model,
        weighting_scheme=NoWeightingScheme(),
        mini_batch_size=mini_batch_size,
    )

And the result was:

model: yjoonjang/reranker-msmarco-v1.1-MiniLM-L12-H384-uncased-lambdaloss-noweight
result:

Metric	Value
map	0.4633 (+0.0733)
mrr@10	0.5471 (+0.0791)
ndcg@10	0.5183 (+0.0629)

The two results above should be same theoretically, but it differs. I think it is firstly due to random initialization of classifier head (as Tom mentioned: Add Position-Aware ListMLELoss tomaarsen/sentence-transformers#6 (comment)
Also, it is due to whether I'm logging the weighted_probas or not. (LambdaLoss has losses = torch.log2(weighted_probas))

My suggestion

So if you would like to implement RankNetLoss, there could be two scenarios we can take.

Implementing the code above. (Or delete logging for the weighted_probas when using RankNetLoss)
Just initializing LambdaLoss with NoWeightingScheme. This would be like working on ListMLE classing PListMLE which @tomaarsen and I talked before (Add Position-Aware ListMLELoss tomaarsen/sentence-transformers#6 (comment)).

What are your thoughts?

tomaarsen · 2025-03-19T18:50:55Z

I'm definitely interested - I think it might make most sense to implement it as a subclass of the LambdaLoss!
As for our msmarco training scripts - I think we should try and add some seeding to the initialization, so we can move easily use those scripts to compare training performance (albeit just with a sample size of 1).

* Add RankNetLoss and training script * Fix ListMLELoss documentation * Fix RankNet to class LambdaLoss * Update training script for RankNetLoss * Use super().get_config_dict() and remove weighting scheme It's a bit confusing to include the weighting scheme in the config if the RankNet loss doesn't have a notion of that * Add to __init__.py for easier import * Correctly capitalize citation titles * Introduce reproducibility for the msmarco scripts * Add more docs for RankNetLoss * Add RankNet to Loss Overview & API Reference * Expand on RankNet docs slightly --------- Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>

into feat/cross_encoder_trainer

…viduals/164.pdf and https://arxiv.org/abs/2005.10084

Also fix version comparison for ST - I can't believe it was doing greater-than on strings for so long

tomaarsen · 2025-03-25T13:30:08Z

In conclusion:

Cross Encoder (i.e. reranker) refactor:
- New Trainer, Training Arguments, Data Collator
- 11 new losses
- 1 new, 3 refactored, 6 deprecated evaluators
- Model card generation
- 100% backwards compatibility with old training (model.fit)
Tests:
- 83 tests for CE Loading, inference, training, etc.
Docs:
- All new Training Overview, Loss Overview, API Reference docs
- 5 new, 1 refactored training examples docs pages
- 13 new, 6 refactored training scripts
- Migration guide (2.x -> 3.x, 3.x -> 4.x)

Blogpost coming on release day.

Big thanks to @milistu and @yjoonjang for their huge roles in the learning-to-rank losses.

Tom Aarsen

tomaarsen and others added 25 commits February 7, 2025 13:07

CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc.

398ab3d

Remove some dead code

c0ca4b9

Merge branch 'master' of https://github.com/UKPLab/sentence-transformers

38b7006

into feat/cross_encoder_trainer

Update incorrect docstring in ce training_nli

8bd570a

Remove newline

ad18659

Import as class instead of module

314f640

Fix CE inference tests as evaluators now output dicts

e97584a

Separate fit into mixin, rename fit -> old_fit, introduce fit based o…

8177924

…n Trainer

Update more CE training scripts

54b80c3

Fix push_to_hub - it didn't create a model card correctly

edb8f43

Fix link to loss docs

e1c2e49

Wrap NanoBEIR dataset loading in tqdm

fa88950

Add number of labels to the model card; avoid rank if not applicable

c780074

Fix load_best_model_at_end, update training script

38732c6

Fix Usage snippet

2568187

Add get_config_dict for BCE loss

8b7a22a

Add get_config_dict to all relevant new CE losses

d8ef778

Lots of progress on loss/training overview docs

707305a

Add updated msmarco scripts that perform quite well

4f52180

Add MarginMSE from Hofstätter

089bf87

Move around docstrings for Evaluators slightly

4c4ce13

Add new valid lael columns, add dataclass decorator.

992f59b

Add ListNetLoss.

cb986f2

Add ListNetLoss, new listwise loss.

e04f06f

Add example script for ListNetLoss.

bbbb5e1

yjoonjang mentioned this pull request Mar 17, 2025

Add Position-Aware ListMLELoss tomaarsen/sentence-transformers#6

Merged

tomaarsen added 3 commits March 17, 2025 14:09

Patch lambda pre-processed script

f984e3a

Fix incorrect sampling_strategy comment

fbf188d

Fix correct output dtype of CE evaluators

bd86a28

tomaarsen mentioned this pull request Mar 19, 2025

Fix JSON Serialization Error in TrainerState due to np.float32 #3250 #3251

Closed

yjoonjang mentioned this pull request Mar 20, 2025

Add RankNetLoss tomaarsen/sentence-transformers#7

Merged

tomaarsen and others added 4 commits March 20, 2025 14:28

Remove accidentally committed training script updates

66dbff4

Rename docs model

9844fbd

Merge branch 'master' of https://github.com/UKPLab/sentence-transformers

dee6f9a

into feat/cross_encoder_trainer

tomaarsen mentioned this pull request Mar 20, 2025

feat: add 'Path' parameter for ModelCard template #3253

Merged

tomaarsen added 3 commits March 20, 2025 17:58

Update ranking according to https://auai.org/uai2014/proceedings/indi…

c0c6da0

…viduals/164.pdf and https://arxiv.org/abs/2005.10084

Allow predict_example to work with str query, list answers

152fbcd

Add version tracking to CE to suggest upgrading versions

0cf5250

Also fix version comparison for ST - I can't believe it was doing greater-than on strings for so long

tomaarsen mentioned this pull request Mar 21, 2025

Additional Trainer Argument for features of different modalities #3225

Open

tomaarsen marked this pull request as ready for review March 21, 2025 12:31

tomaarsen added 5 commits March 24, 2025 10:12

Merge branch 'master' into feat/cross_encoder_trainer

5af1cb0

Use activation_fn instead of activation_fct consistently

9111f83

Add a migration guide for v2.x -> v3.x and v3.x -> v4.x

98d686b

Slightly update extensive training script

ab8b5d8

Update sbert.net index with CE mentions

447296d

tomaarsen changed the title ~~[feat] CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc.~~ [v4] CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc. Mar 25, 2025

Patch links in pretrained CE models

5973cb5

tomaarsen merged commit b3a3ecf into UKPLab:master Mar 25, 2025
9 checks passed

tomaarsen mentioned this pull request Mar 25, 2025

how can i save fine_tuned cross-encoder to HF and then download it from HF #2499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[`v4`] CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc. #3222

[`v4`] CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc. #3222

Uh oh!

tomaarsen commented Feb 7, 2025 •

edited

Loading

Uh oh!

milistu commented Feb 15, 2025

Uh oh!

tomaarsen commented Feb 16, 2025 •

edited

Loading

Uh oh!

tomaarsen commented Feb 17, 2025

Uh oh!

milistu commented Feb 17, 2025

Uh oh!

tomaarsen commented Feb 17, 2025 •

edited

Loading

Uh oh!

yjoonjang commented Mar 17, 2025 •

edited

Loading

Uh oh!

yjoonjang commented Mar 19, 2025

Uh oh!

tomaarsen commented Mar 19, 2025

Uh oh!

tomaarsen commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

[v4] CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc. #3222

[v4] CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc. #3222

Uh oh!

Conversation

tomaarsen commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request overview

TODOs

Details

Uh oh!

milistu commented Feb 15, 2025

Changes

🔗 Link to My Branch

Uh oh!

tomaarsen commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaarsen commented Feb 17, 2025

Uh oh!

milistu commented Feb 17, 2025

Trained Model

Observations on Loss

Issue with Missing Values in Evaluation CSV

Uh oh!

tomaarsen commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yjoonjang commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About ListMLELoss

PR

Uh oh!

yjoonjang commented Mar 19, 2025

My suggestion

Uh oh!

tomaarsen commented Mar 19, 2025

Uh oh!

tomaarsen commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

[`v4`] CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc. #3222

[`v4`] CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc. #3222

tomaarsen commented Feb 7, 2025 •

edited

Loading

tomaarsen commented Feb 16, 2025 •

edited

Loading

tomaarsen commented Feb 17, 2025 •

edited

Loading

yjoonjang commented Mar 17, 2025 •

edited

Loading