Skip to content

Conversation

tomaarsen
Copy link
Collaborator

@tomaarsen tomaarsen commented Feb 7, 2025

Hello!

Pull Request overview

  • Introduce a new training loop, using a CrossEncoderTrainer, CrossEncoderTrainingArguments, and loss functions. Brings features such as:
    • Multi-GPU training (DP, DDP), multi-dataset training, bf16, loss logging (Weights & Biases, Tensorboard, the terminal, etc.), many additional hyperparameters, automatic model card generation, etc.
  • A new MultipleNegativesRankingLoss and CachedMultipleNegativesRankingLoss (a.k.a. InfoNCE and InfoNCE with GradCache) to train with anchor-positive pairs, anchor-positive-negative triplets, and anchor-positive-negative1-...-negativeN tuples.

TODOs

  • Additional documentation
  • CE Tests - training & inference
  • Implementing more losses from the literature (can also be done after the initial release)
    • MarginMSE
  • Determine solid defaults for MNRL and CMNRL
    • Sigmoid with scale=10.0 seems good enough for now
  • Test MSELoss - currently never been ran
  • Update existing training scripts to the new flow
    • Remove old scripts in the original style
  • Rename fit to old_fit and create a new fit method that depends on CrossEncoderTrainer. Goal: no real backwards incompatibility with existing training scripts.
  • add CE NanoBEIR evaluator

Details

Overall, the goal of this refactor is to introduce feature parity between the Cross Encoder training and the Sentence Transformer training. Luckily, the work done for the ST trainer can be extended rather easily, so the refactor is not as big as it was for the SentenceTransformer class in v3.0.

Notably, training now centers around:

  1. a training Dataset or DatasetDict. This class is much more suited for sharing & efficient modifications than lists/DataLoaders of InputExample instances. A Dataset can contain multiple text columns that will be fed in order to the corresponding loss function. So, if the loss expects (anchor, positive, negative) triplets, then your dataset should also have 3 columns. The names of these columns are irrelevant at this time. If there is a "label" column, it is treated separately, and used as the labels during training.
    A DatasetDict can be used to train with multiple datasets at once, e.g.:
    DatasetDict({
        multi_nli: Dataset({
            features: ['premise', 'hypothesis', 'label'],
            num_rows: 392702
        })
        snli: Dataset({
            features: ['snli_premise', 'hypothesis', 'label'],
            num_rows: 549367
        })
        stsb: Dataset({
            features: ['sentence1', 'sentence2', 'label'],
            num_rows: 5749
        })
    })
    When a DatasetDict is used, the loss parameter to the CrossEncoderTrainer must also be a dictionary with these dataset keys, e.g.:
    {
        'multi_nli': BinaryCrossEntropyLoss(...),
        'snli': BinaryCrossEntropyLoss(...),
        'stsb': CrossEntropyLoss(...),
    }
    By default, these are sampled from in proportion to their sizes.
  2. A loss function, or a dictionary of loss functions like described above. We now support highly customizable losses, much more than before. The loss can now e.g. make pairs dynamically, as the loss gets "raw" texts, not pretokenized inputs.
  3. A CrossEncoderTrainingArguments instance, subclass of a SentenceTransformerTrainingArguments instance. This powerful class controls the specific details of the training.
  4. An optional SentenceEvaluator instance. These instances either return a float, or a dictionary with metric keys and values. If the latter, the class must also defined evaluator.primary_metric so e.g. the "best model" checkpointing can be based on an evaluator score.
    Models can now be evaluated both on an evaluation dataset with some loss function and/or a SentenceEvaluator instance.
  5. The new CrossEncoderTrainer instance. This instance is provided with a CrossEncoder model, a CrossEncoderTrainingArguments class, a SentenceEvaluator, a training and evaluation Dataset/DatasetDict and a loss function/dict of loss functions. Most of these parameters are optional. Once provided, all you have to do is call train().

This is an example of an extensive training script with all of the features at play:

import logging
from datetime import datetime

from datasets import load_dataset

from sentence_transformers.cross_encoder import CrossEncoder
from sentence_transformers.cross_encoder.evaluation import CEClassificationEvaluator
from sentence_transformers.cross_encoder.losses.CrossEntropyLoss import CrossEntropyLoss
from sentence_transformers.cross_encoder.trainer import CrossEncoderTrainer
from sentence_transformers.cross_encoder.training_args import CrossEncoderTrainingArguments

# Set the log level to INFO to get more information
logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)

train_batch_size = 64
num_epochs = 1
output_dir = "output/training_ce_allnli-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# 1. Define our CrossEncoder model. We use distilroberta-base as basis and setup it up to predict 3 labels
# You can also use other base models, like bert-base-uncased, microsoft/mpnet-base, etc.
model = CrossEncoder("distilroberta-base", num_labels=3)

# 2. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
# We'll start with 10k training samples, but you can increase this to get a stronger model
logging.info("Read AllNLI train dataset")
train_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="train").select(range(10000))
eval_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev").select(range(1000))
test_dataset = load_dataset("sentence-transformers/all-nli", "pair-class", split="test")
logging.info(train_dataset)

# We might want to remap labels from the dataset, you can do that like so:
mapping = {0: 1, 1: 2, 2: 0}
eval_dataset = eval_dataset.map(lambda x: {"label": mapping[x["label"]]})
test_dataset = test_dataset.map(lambda x: {"label": mapping[x["label"]]})

# 3. Define our training loss:
loss = CrossEntropyLoss(model)

# During training, we use CEClassificationEvaluator to measure the performance on the dev set
dev_cls_evaluator = CEClassificationEvaluator(
    list(zip(eval_dataset["premise"], eval_dataset["hypothesis"])),
    eval_dataset["label"],
    name="AllNLI-dev",
)
dev_cls_evaluator(model)

# 5. Define the training arguments
args = CrossEncoderTrainingArguments(
    # Required parameter:
    output_dir=output_dir,
    # Optional training parameters:
    num_train_epochs=num_epochs,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=train_batch_size,
    warmup_ratio=0.1,
    fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=True,  # Set to True if you have a GPU that supports BF16
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,
    logging_steps=100,
    run_name="ce-nli-v1",  # Will be used in W&B if `wandb` is installed
)

# 6. Create the trainer & start training
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=dev_cls_evaluator,
)
trainer.train()

# 7. Evaluate the final model on test dataset
test_cls_evaluator = CEClassificationEvaluator(
    list(zip(test_dataset["premise"], test_dataset["hypothesis"])),
    test_dataset["label"],
    name="AllNLI-test",
)
test_cls_evaluator(model)

# 8. Evaluate the final model and save it
final_output_dir = f"{output_dir}/final"
model.save_pretrained(final_output_dir)

As you may note, it is very similar to the new SentenceTransformer flow: datasets Dataset, standalone loss with a lot more flexibility than before, a TrainingArguments and Trainer class, Evaluators much like before & as used in SentenceTransformer training, etc.

cc @milistu as you're also working on CrossEncoders
cc @LysandreJik

  • Tom Aarsen

@milistu
Copy link
Contributor

milistu commented Feb 15, 2025

Hi @tomaarsen

I've used your fork and branch as a base and added my implementation for ListNet Loss.

Changes

  • Updated CrossEncoderDataCollator

    • Added new values in: valid_label_columns.
    • Decorated CrossEncoderDataCollator with @dataclass to ensure changes take effect correctly. Previously, it was using variables from SentenceTransformerDataCollator even after modification. Now everything works as expected.
  • Implemented ListNetLoss

    • Added ListNetLoss implementation.
    • Updated __init__.py to include the new loss function.
  • Added an example script

    • Demonstrates how to prepare a dataset and use ListNetLoss with the MS MARCO dataset.

🔗 Link to My Branch

feat/cross_encoder_trainer

Let me know what you think or if any modifications are needed!

@tomaarsen
Copy link
Collaborator Author

tomaarsen commented Feb 16, 2025

Hello!

This is excellent work, looks very solid! Have you been able to run the training script yourself so far? I can also try and run it and upload the finished model.
Edit: I trained one: https://huggingface.co/tomaarsen/reranker-msmarco-v1.1-MiniLM-L12-H384-uncased-listnet

In the coming days I can try and merge your work into this PR.

  • Tom Aarsen

@tomaarsen
Copy link
Collaborator Author

It's interesting to see that although the model does get better than the BM25 baseline, the loss effectively does not change.

@milistu
Copy link
Contributor

milistu commented Feb 17, 2025

Hi @tomaarsen 👋

I successfully trained the model and experimented with hyperparameters.

Trained Model

You can find the trained model here:
Studeni/reranker-msmarco-v1.1-ModernBERT-base-listnet

Observations on Loss

I noticed that the loss is slightly higher (~2.0). Through research and testing, I found that this discrepancy arises due to differences in distribution:

  • Ground truth labels are binary, which results in a sharp distribution.
  • Our CrossEncoder outputs discrete values, leading to a different distribution compared to the ground truth.
  • This mismatch explains the higher loss value.

One possible solution is to apply a transformation to the predicted distribution to better align it with the ground truth. However, for now, I think this is sufficient. Instead of refining this approach further, I’d prefer to integrate more listwise loss functions that are known to outperform ListNet.

Additionally, I assume that combining MSE and ListNet loss could yield better results by leveraging the strengths of both approaches. I can explore this further.

Issue with Missing Values in Evaluation CSV

While training, I noticed missing values in the evaluation CSVs.

Example from CERerankingEvaluator_NanoMSMARCO_results_@10.csv:

epoch,steps,MAP,MRR@10,NDCG@10
0.17214666896195557,2000,0.04166666666666666,0.0654804137172179
0.34429333792391115,4000,0.2629365079365079,0.32231584858096596
0.5164400068858668,6000,0.417079365079365,0.4742922970864457
0.6885866758478223,8000,0.45016666666666666,0.5006481726476146
0.8607333448097779,10000,0.49913492063492065,0.5646165118665413
1.0328800137717336,12000,0.49426984126984125,0.5609090407025468
1.205026682733689,14000,0.4636349206349206,0.5413296770100868
1.3771733516956446,16000,0.4563809523809524,0.5233662617261322

Here, we expect five columns, but only four values appear in some rows.
Initially, I suspected that NDCG was missing, but after further inspection, I believe MAP is missing. This issue occurs in all evaluation CSVs.

Has this happened in your training as well? I used the same (or a very similar) setup from your MS MARCO training example.

@tomaarsen
Copy link
Collaborator Author

tomaarsen commented Feb 17, 2025

I'll investigate the CSV issue, that one is definitely on me.

Beyond that, I've implemented an activation_fct parameter to all losses that post-process the logits to try and help with the distributions. For example, with MultipleNegativesRankingLoss (a.k.a. InfoNCE loss) I use Tanh based on https://arxiv.org/abs/2407.19669 to map the predictions as scores between -1 and 1 before using them in Cross Entropy Loss. Perhaps something like that is required for ListNetLoss as well?

I'm definitely open to other Listwise loss implementations! I'm currently looking into improving mine_hard_negatives so that it can be used to generate training and evaluation data with larger datasets.

  • Tom Aarsen

@yjoonjang
Copy link
Contributor

yjoonjang commented Mar 17, 2025

Hi @milistu, @tomaarsen
I don't know this is a right way to contribute, but I've implemetned ListMLELoss to @tomaarsen's cross_encoder_trainer branch.

About ListMLELoss

ListMLE is a listwise learning-to-rank loss function that directly optimizes the likelihood of the correct permutation of documents. It models the probability of a permutation using the Plackett-Luce model, which sequentially selects items based on their scores.
The key difference between ListMLE and ListNet is in how they model the ranking problem:

  • ListNet uses a softmax to convert scores to probability distributions and minimizes the cross-entropy between the predicted distribution and the ground truth distribution. It focuses on the probability of items being ranked at the top position.
  • ListMLE directly models the probability of the entire permutation and maximizes the likelihood of the correct ordering. It captures the sequential nature of the ranking process, where each position is filled one by one.

Additionally, I've implemented Position-Aware ListMLE with lambda weighting, which applies different weights to different rank positions. This allows the model to focus more on getting the top positions correct, which is often more important in ranking tasks.

This loss is particularly valuable when dealing with multiple relevant documents that have a clear preference order. For example, when training a reranker for tool selection, some tools should be ranked higher than others for a given query, even though both are relevant. ListMLE can effectively learn these nuanced preferences by modeling the entire permutation probability, ensuring that the most suitable tool appears first in the ranking, followed by the second-best option, and so on.

PR

tomaarsen#6

  • Youngjoon Jang

* Add ListMLELoss

* Fix input_order not being considered

* Update init.py

* Add training scripts for ListMLELoss

* Fix self.lambda_weight to ListMLELambdaWeight

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

* Refactor conditional logic in ListMLELambdaWeight

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

* Fix Delete unused function - create_p_list_mle_lambda_weight

* Refactor mask creation using zeros_like

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

* Refactor for-loop with vectorized operations when applying position weights

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

* Fix reference and citation

* Refactor to seperate PListMLE and ListMLE

* Refactor training scripts for PListMLE and ListMLE

* Add information of data to be sorted in a defined rank order

* Run formatting

* Ensure that paddings are excluded in the loss

* Remove lambda_weight as an option from ListMLELoss

* Update documentation throughout

* Update MS MARCO training examples docs

* Fix anecdotal ranking

* Vectorize the lambda weight computation

It should be equivalent, and considerably faster (although it wasn't necessarily a bottleneck)

* Add +1 to PListMLELambdaWeight; normalize weight by divide-by-sum

* Simplify code, remove duplicate +1

* Update get_config_dict for ListMLELoss to remove lambda_weight

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
@yjoonjang
Copy link
Contributor

Hi @tomaarsen, @milistu

I was wondering if you would also like to implement RankNetLoss (a.k.a pairwise logistic loss)
This loss represents the 'pairwise loss', and is also used in RankLLM (https://github.com/castorini/rank_llm)

I've actually worked on the code and did some experiments about activation functions.

code
from __future__ import annotations

from typing import Literal

import torch
from torch import Tensor, nn

from sentence_transformers.cross_encoder import CrossEncoder
from sentence_transformers.util import fullname


class RankNetLoss(nn.Module):
    def __init__(
        self,
        model: CrossEncoder,
        sigma: float = 1.0,
        eps: float = 1e-10,
        activation_fct: nn.Module | None = nn.Identity(),
        mini_batch_size: int | None = None,
    ) -> None:
        """
        RankNet loss implementation for learning to rank. This loss function implements the RankNet algorithm,
        which learns a ranking function by optimizing pairwise document comparisons using a neural network.
        The implementation is optimized to handle padded documents efficiently by only processing valid
        documents during model inference.

        Args:
            model (CrossEncoder): CrossEncoder model to be trained
            sigma (float): Score difference weight used in sigmoid (default: 1.0)
            eps (float): Small constant for numerical stability (default: 1e-10)
            activation_fct (:class:`~torch.nn.Module`): Activation function applied to the logits before computing the
                loss. Defaults to :class:`~torch.nn.Identity`.
            mini_batch_size (int, optional): Number of samples to process in each forward pass. This has a significant
                impact on the memory consumption and speed of the training process. Three cases are possible:
                - If ``mini_batch_size`` is None, the ``mini_batch_size`` is set to the batch size.
                - If ``mini_batch_size`` is greater than 0, the batch is split into mini-batches of size ``mini_batch_size``.
                - If ``mini_batch_size`` is <= 0, the entire batch is processed at once.
                Defaults to None.

        References:
            - Learning to Rank using Gradient Descent: https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf

        Requirements:
            1. Query with multiple documents (pairwise approach)
            2. Documents must have relevance scores/labels. Both binary and continuous labels are supported.

        Inputs:
            +----------------------------------------+--------------------------------+-------------------------------+
            | Texts                                  | Labels                         | Number of Model Output Labels |
            +========================================+================================+===============================+
            | (query, [doc1, doc2, ..., docN])       | [score1, score2, ..., scoreN]  | 1                             |
            +----------------------------------------+--------------------------------+-------------------------------+

        Example:
            ::

                from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses
                from datasets import Dataset

                model = CrossEncoder("microsoft/mpnet-base")
                train_dataset = Dataset.from_dict({
                    "query": ["What are pandas?", "What is the capital of France?"],
                    "docs": [
                        ["Pandas are a kind of bear.", "Pandas are kind of like fish."],
                        ["The capital of France is Paris.", "Paris is the capital of France.", "Paris is quite large."],
                    ],
                    "labels": [[1, 0], [1, 1, 0]],
                })
                loss = losses.RankNetLoss(model)

                trainer = CrossEncoderTrainer(
                    model=model,
                    train_dataset=train_dataset,
                    loss=loss,
                )
                trainer.train()
        """
        super().__init__()
        self.model = model
        self.sigma = sigma
        self.eps = eps
        self.activation_fct = activation_fct or nn.Identity()
        self.mini_batch_size = mini_batch_size

        if self.model.num_labels != 1:
            raise ValueError(
                f"{self.__class__.__name__} supports a model with 1 output label, "
                f"but got a model with {self.model.num_labels} output labels."
            )

    def forward(self, inputs: list[list[str], list[list[str]]], labels: list[Tensor]) -> Tensor:
        """
        Compute RankNet loss for a batch of queries and their documents.

        Args:
            inputs: List of (queries, documents_list)
            labels: Ground truth relevance scores, shape (batch_size, num_documents)

        Returns:
            Tensor: RankNet loss over the batch
        """
        if isinstance(labels, Tensor):
            raise ValueError(
                "RankNetLoss expects a list of labels for each sample, but got a single value for each sample."
            )
        if len(inputs) != 2:
            raise ValueError(f"RankNetLoss expects two inputs (queries, documents_list), but got {len(inputs)} inputs.")

        queries, docs_list = inputs
        docs_per_query = [len(docs) for docs in docs_list]
        max_docs = max(docs_per_query)
        batch_size = len(queries)

        if docs_per_query != [len(labels) for labels in labels]:
            raise ValueError(
                f"Number of documents per query in inputs ({docs_per_query}) does not match number of labels per query ({[len(labels) for labels in labels]})."
            )

        # Create input pairs for the model₩
        pairs = [(query, document) for query, docs in zip(queries, docs_list) for document in docs]

        if not pairs:
            # Handle edge case where all documents are padded
            return torch.tensor(0.0, device=self.model.device, requires_grad=True)

        mini_batch_size = self.mini_batch_size or batch_size
        if mini_batch_size <= 0:
            mini_batch_size = len(pairs)

        logits_list = []
        for i in range(0, len(pairs), mini_batch_size):
            mini_batch_pairs = pairs[i : i + mini_batch_size]

            tokens = self.model.tokenizer(
                mini_batch_pairs,
                padding=True,
                truncation=True,
                return_tensors="pt",
            )
            tokens = tokens.to(self.model.device)

            logits = self.model(**tokens)[0].view(-1)
            logits_list.append(logits)

        logits = torch.cat(logits_list, dim=0)
        logits = self.activation_fct(logits)

        # Create output tensor filled with 0 (padded logits will be ignored via labels)
        logits_matrix = torch.full((batch_size, max_docs), -1e16, device=self.model.device)

        # Place logits in the desired positions in the logit matrix
        doc_indices = torch.cat([torch.arange(len(docs)) for docs in docs_list], dim=0)
        batch_indices = torch.repeat_interleave(torch.arange(batch_size), torch.tensor(docs_per_query))
        logits_matrix[batch_indices, doc_indices] = logits

        # Create labels matrix
        labels_matrix = torch.full_like(logits_matrix, float("-inf"))
        labels_matrix[batch_indices, doc_indices] = torch.cat(labels, dim=0).float()
        labels_matrix = labels_matrix.to(self.model.device)

        # Calculate pairwise differences for scores and labels
        score_diffs = logits_matrix[:, :, None] - logits_matrix[:, None, :]
        label_diffs = labels_matrix[:, :, None] - labels_matrix[:, None, :]

        # Create mask for valid pairs (where both documents are not padded)
        valid_pairs = torch.isfinite(label_diffs)
        
        # Create mask for pairs where l_i > l_j
        positive_pairs = label_diffs > 0

        # Calculate probabilities and target probabilities
        P_ij = torch.sigmoid(self.sigma * score_diffs)
        P_ij = torch.clamp(P_ij, min=self.eps, max=1-self.eps)
        
        # Calculate loss only for pairs where l_i > l_j (positive_pairs)
        losses = -torch.log(P_ij)

        # Apply masks and compute mean loss
        masked_loss = losses[valid_pairs & positive_pairs]

        # Handle case when there are no positive pairs
        if masked_loss.numel() == 0:
            return torch.tensor(0.0, device=self.model.device, requires_grad=True)

        loss = torch.mean(masked_loss)

        return loss

    def get_config_dict(self) -> dict[str, float | int | str | None]:
        """
        Get configuration parameters for this loss function.

        Returns:
            Dictionary containing the configuration parameters
        """
        return {
            "sigma": self.sigma,
            "eps": self.eps,
            "activation_fct": fullname(self.activation_fct),
            "mini_batch_size": self.mini_batch_size,
        }

    @property
    def citation(self) -> str:
        return """
@inproceedings{burges2005learning,
  title={Learning to rank using gradient descent},
  author={Burges, Chris and Shaked, Tal and Renshaw, Erin and Lazier, Ari and Deeds, Matt and Hamilton, Nicole and Hullender, Greg},
  booktitle={Proceedings of the 22nd international conference on Machine learning},
  pages={89--96},
  year={2005}
}
""" 

The result:
image

To simplify, the best recipe for training the reranker model was just using the Identity function.

Metric Value
map 0.4775 (+0.0874)
mrr@10 0.5597 (+0.0917)
ndcg@10 0.5298 (+0.0745)

However, looking at LambdaLoss, I found out that RankNetLoss is same with LambdaLoss when using NoWeightingScheme.

So I trained the model with

loss = LambdaLoss(
        model=model,
        weighting_scheme=NoWeightingScheme(),
        mini_batch_size=mini_batch_size,
    )

And the result was:

Metric Value
map 0.4633 (+0.0733)
mrr@10 0.5471 (+0.0791)
ndcg@10 0.5183 (+0.0629)

My suggestion

So if you would like to implement RankNetLoss, there could be two scenarios we can take.

  1. Implementing the code above. (Or delete logging for the weighted_probas when using RankNetLoss)
  2. Just initializing LambdaLoss with NoWeightingScheme. This would be like working on ListMLE classing PListMLE which @tomaarsen and I talked before (Add Position-Aware ListMLELoss tomaarsen/sentence-transformers#6 (comment)).

What are your thoughts?

@tomaarsen
Copy link
Collaborator Author

I'm definitely interested - I think it might make most sense to implement it as a subclass of the LambdaLoss!
As for our msmarco training scripts - I think we should try and add some seeding to the initialization, so we can move easily use those scripts to compare training performance (albeit just with a sample size of 1).

tomaarsen and others added 4 commits March 20, 2025 14:28
* Add RankNetLoss and training script

* Fix ListMLELoss documentation

* Fix RankNet to class LambdaLoss

* Update training script for RankNetLoss

* Use super().get_config_dict() and remove weighting scheme

It's a bit confusing to include the weighting scheme in the config if the RankNet loss doesn't have a notion of that

* Add to __init__.py for easier import

* Correctly capitalize citation titles

* Introduce reproducibility for the msmarco scripts

* Add more docs for RankNetLoss

* Add RankNet to Loss Overview & API Reference

* Expand on RankNet docs slightly

---------

Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
@tomaarsen tomaarsen changed the title [feat] CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc. [v4] CrossEncoder Training refactor - MultiGPU, loss logging, bf16, etc. Mar 25, 2025
@tomaarsen
Copy link
Collaborator Author

In conclusion:

  • Cross Encoder (i.e. reranker) refactor:
    • New Trainer, Training Arguments, Data Collator
    • 11 new losses
    • 1 new, 3 refactored, 6 deprecated evaluators
    • Model card generation
    • 100% backwards compatibility with old training (model.fit)
  • Tests:
    • 83 tests for CE Loading, inference, training, etc.
  • Docs:
    • All new Training Overview, Loss Overview, API Reference docs
    • 5 new, 1 refactored training examples docs pages
    • 13 new, 6 refactored training scripts
    • Migration guide (2.x -> 3.x, 3.x -> 4.x)

Blogpost coming on release day.

Big thanks to @milistu and @yjoonjang for their huge roles in the learning-to-rank losses.

  • Tom Aarsen

@tomaarsen tomaarsen merged commit b3a3ecf into UKPLab:master Mar 25, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cross-Encoders training seed
3 participants