Skip to content

Conversation

Samoed
Copy link
Member

@Samoed Samoed commented Aug 2, 2025

Closes #2728

Added this interface instead of DenseRetrievalExactSearch

class SearchInterface(Protocol):
    """Interface for searching models."""

    def index(
        self,
        corpus: CorpusDatasetType,
        *,
        task_metadata: TaskMetadata,
        hf_split: str,
        hf_subset: str,
        encode_kwargs: dict[str, Any],
    ) -> None:
        ...

    def search(
        self,
        queries: QueryDatasetType,
        *,
        task_metadata: TaskMetadata,
        hf_split: str,
        hf_subset: str,
        top_k: int,
        encode_kwargs: dict[str, Any],
        instructions: InstructionDatasetType | None = None,
        top_ranked: TopRankedDocumentsType | None = None,
    ) -> RetrievalOutputType:
     ...

@Samoed Samoed requested a review from Copilot August 5, 2025 06:16
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates a new search interface to replace the existing DenseRetrievalExactSearch approach. It introduces a unified SearchInterface protocol that supports both encoder-based and cross-encoder models for retrieval tasks, providing a more flexible and extensible architecture.

Key changes:

  • Introduces SearchProtocol and CrossEncoderProtocol interfaces for standardized model interaction
  • Refactors retrieval evaluation to use wrapper classes (SearchEncoderWrapper, SearchCrossEncoderWrapper)
  • Consolidates model protocols into a centralized models_protocols.py file

Reviewed Changes

Copilot reviewed 63 out of 63 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mteb/models/models_protocols.py Defines new search and cross-encoder protocols with standardized interfaces
mteb/models/search_wrappers.py Implements wrapper classes that adapt encoders and cross-encoders to the search interface
mteb/evaluation/evaluators/retrieval_evaluator.py Refactors to use the new search interface instead of DenseRetrievalExactSearch
mteb/models/sentence_transformer_wrapper.py Renames classes and separates cross-encoder functionality
Various model implementation files Updates import paths and class names to use new wrapper naming

@Samoed
Copy link
Member Author

Samoed commented Aug 5, 2025

I've implemented a search wrapper for the cross-encoder to support only reranking tasks. I think the v2 implementation of cross-encoder search is a bit incorrect. In this part, we're selecting all corpus IDs, and I believe it always selects the same IDs every time:

q_results = dict.fromkeys(corpus["id"], 0.0)
else:
q_results = self.previous_results[qid]
# take the top-k only
q_results_sorted = dict(
sorted(q_results.items(), key=lambda item: item[1], reverse=True)
)
top_n = [k for k, v in list(q_results_sorted.items())[:top_k]]
query = row["text"]

I think we can change to encode all pairs from qrels instead.

For now, I think we don't want to support two-stage encoding. We can add it later.

@Samoed
Copy link
Member Author

Samoed commented Aug 5, 2025

Also I'm not sure if we need top_ranked split if it have same logic and values as in qrels (relevant_document) split? It's used only to get corpus ids of query, but I think this can be done with relevant_document too.

if query_id not in top_ranked:
logger.warning(f"No pre-ranked documents found for query {query_id}")
continue
ranked_ids = top_ranked[query_id]

I've look to the some datasets and they have same information in qrels and top_ranked

CC @orionw

@orionw
Copy link
Contributor

orionw commented Aug 7, 2025

Also I'm not sure if we need top_ranked split if it have same logic and values as in qrels (relevant_document) split? It's used only to get corpus ids of query, but I think this can be done with relevant_document too.

For now, I think we don't want to support two-stage encoding. We can add it later.

I think these are related. top_ranked is simulating the 2nd stage pass where your first stage has found some documents but you aren't sure if they're relevant or not. If you just used the qrels than for some datasets then you'd be limited to reranking documents that have been judged (which in some cases is just relevant docs).

My guess is there are some datasets in MTEB for reranking that have nearly identical qrels and top_ranked, but that they are other datasets where that also doesn't happen (and/or future datasets/flexibility).

I've implemented a search wrapper for the cross-encoder to support only reranking tasks. I think the v2 implementation of cross-encoder search is a bit incorrect. In this part, we're selecting all corpus IDs, and I believe it always selects the same IDs every time:

We do want to only pick the ids they give for reranking, but it should differ per query. I think top_ranked is a mapping for this reason, to select the ids to use for each query.

@Samoed
Copy link
Member Author

Samoed commented Aug 7, 2025

My guess is there are some datasets in MTEB for reranking that have nearly identical qrels and top_ranked, but that they are other datasets where that also doesn't happen (and/or future datasets/flexibility).

For now I think they're always the same, because they were reuploaded by you.

top_ranked is simulating the 2nd stage pass where your first stage has found some documents but you aren't sure if they're relevant or not.

But this can be achived by using qrels too

We do want to only pick the ids they give for reranking, but it should differ per query. I think top_ranked is a mapping for this reason, to select the ids to use for each query.

I think same information is stored in qrels

Comment on lines +302 to +305
if top_ranked is None:
raise ValueError(
"CrossEncoder search requires top_ranked documents for reranking."
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented cross-encoders to support reranking task only, but we have test:

  1. that testing cross-encoders on retrieval
  2. 2 stage reranking
    Do we want to support these cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @KennethEnevoldsen thinks, but the 2nd stage reranking is nearly identical to just normal reranking and almost comes for free. I would vote to keep it if there's not much added complexity (unless things have changed and now there's more complexity)

@Samoed
Copy link
Member Author

Samoed commented Aug 7, 2025

@KennethEnevoldsen What do you thing about relevant_docs/top_ranked?

@orionw
Copy link
Contributor

orionw commented Aug 7, 2025

But this can be achived by using qrels too

Not always. Maybe the easiest example is that you don't need to include non-relevant documents in qrels. So if I were to upload a new reranking dataset with only positive qrels then we'd only ever get 100% (since we're only reranking the positives with no negatives).

I think the old format used to always specify that they were either non-relevant or relevant, and so when I uploaded them those went to the qrels even though they wouldn't usually (but for equal scores with the old format they had to be).

But you don't have to specify the non-relevant documents for most retrieval/reranking datasets so forcing it to be that way would limit future datasets.

@KennethEnevoldsen
Copy link
Contributor

For now, I think we don't want to support two-stage encoding. We can add it later.

We should at least have it in v2. However, we might limit the scope of this PR and just create an issue on it.

mteb.evaluate currently doesn't support it, though, but it should be possible to solve this using a wrapper:

I am thinking something like this:

retrieval_task = ... 
model_stage1 = ...
model_stage2 = ...

model_stage1 = SavePredictionsWrapper(model_stage1, path=...)
res = mteb.evaluate(model_stage1, retrieval_task)

model_stage2 = LoadPredictions(model_stage2, path=...)
res = mteb.evaluate(model_stage1, retrieval_task)

The hope is that this would maintain a fairly simple evaluation loop and move the two-stage into a different component. WDYT?

Note: This does not support saving the errors (which is also not supported in mteb.evaluate, but it also currently not documented anywhere.

regarding qrels and top_ranked

I am not entirely sure what the right move is, but I would probably trust @orionw's intuition more than my own here. In which case I would keep them.

However, if there are no cases where they are different, then I am not sure it is needed. Is it used to compute certain metrics?

@Samoed
Copy link
Member Author

Samoed commented Aug 9, 2025

However, if there are no cases where they are different, then I am not sure it is needed. Is it used to compute certain metrics?

It was only used to take pairs of query id, doc id for reranking tasks for the encoders.

Also I'm not sure if cross-encoders should support retrieval tasks or not. If we make them to use qrels, then it can be used with both reranking and and retrieval. If only top ranked then retrieval task only (as implemented now)

The hope is that this would maintain a fairly simple evaluation loop and move the two-stage into a different component. WDYT?

I'm not sure about wrappers for the models, because this problem is more related to task, than to the models and results of previous run should be passed to the evaluator somehow.

Also we might want to find a solution how to save results of these run, e.g. to show them on leaderboard

@KennethEnevoldsen
Copy link
Contributor

Also I'm not sure if cross-encoders should support retrieval tasks or not. If we make them to use qrels, then it can be used with both reranking and and retrieval. If only top ranked then retrieval task only (as implemented now)

I think they should. We want people to evaluate realistic use-cases. I would argue that it is more realistic than the reranking case.

(@orionw)

I'm not sure about wrappers for the models, because this problem is more related to task, than to the models and results of previous run should be passed to the evaluator somehow.

Hmm I don't think it is a concern for the evaluator. The model can respond with the same documents as the previous model + its own edits. No reason for the evaluator to know what the previous model predicted.

@Samoed
Copy link
Member Author

Samoed commented Aug 9, 2025

I think they should

So cross-encoders would use only qrels or we can make them to use qrels by default and too_ranked if they passed.

Hmm I don't think it is a concern for the evaluator.

But in that case evaluator will pass pairs of quries and documents (as standart retrieval) to the model and will expect their scores and I don't understand how model will use top ranked then.

@orionw
Copy link
Contributor

orionw commented Aug 9, 2025

However, if there are no cases where they are different, then I am not sure it is needed. Is it used to compute certain metrics?

There are cases where they are different - at least all of the InstructionReranking tasks

The somewhat long background is that for MTEB v1, the reranking tasks used to use custom metrics and contain these fields:

        query: str
        positive: list[str]
        negative: list[str]

In #1359 we consolidated retrieval/reranking so they could both use the TREC evaluator and now they share much of the same logic.

So in order to maintain the same scores as MTEB v1 reranking, we had to add all the negatives to the negative qrels. But that is not how reranking is normally done, so it's a bit of a hack. More recent and future datasets won't do it this way.

Any reranking task formulation has to have both qrels and top_ranked since they are generally quite disinct. Imagine I wanted to create a new NQ reranking task (NQ has two positives per query): NQ doesn't have qrels for documents that are non-relevant since humans only validated the positives. If I didn't upload a top_ranked, it would only rerank from the qrels, which would only rerank the two relevant documents per query. This would always achieve 100% on all metrics. The top_ranked field instead tells the model what to rerank and the qrels is what has been judged to be relevant or not.

Also I'm not sure if cross-encoders should support retrieval tasks or not. If we make them to use qrels, then it can be used with both reranking and and retrieval. If only top ranked then retrieval task only (as implemented now)

Sorry, I don't think I fully follow this @Samoed. Cross-Encoders shouldn't generally do retrieval without top_ranked since they would have to rerank every combination of query and doc (very expensive), although it is technically feasible. But Embedding models could do either one and should already by able to handle both.

Also we might want to find a solution how to save results of these run, e.g. to show them on leaderboard

I do think we will have to come up with a solution for leaderboard style evaluation, but I think the wrapper and current solution works great for practical evaluation/use.

I think the wrapper style would be quite easy to use @KennethEnevoldsen!

@Samoed
Copy link
Member Author

Samoed commented Aug 9, 2025

Cross-Encoders shouldn't generally do retrieval without top_ranked since they would have to rerank every combination of query and doc (very expensive), although it is technically feasible.

I agree with you, but in our tests, one of the failures in this branch is that cross-encoders are being tested on retrieval tasks (other cases are two-stage reranking, but still on retrieval) and in main branch we have handling of cross-encoders in retrieval

def search_cross_encoder(
, and that’s why I raised this question.

I do think we will have to come up with a solution for leaderboard style evaluation, but I think the wrapper and current solution works great for practical evaluation/use.

I think we could create a wrapper for tasks to store information about top_ranked so it can be reused in the evaluator instead of in the model.

@orionw
Copy link
Contributor

orionw commented Aug 9, 2025

I agree with you, but in our tests, one of the failures in this branch is that cross-encoders are being tested on retrieval tasks (other cases are two-stage reranking, but still on retrieval), and that’s why I raised this question.

Ah I see. Yeah I think we could drop that test and honestly we should print a warning to the user if they are doing a retrieval task with a cross-encoder. It's possible they mean to (and we could allow it for people who really want to or are using small datasets), but more likely they are making a mistake.

I think we could create a wrapper for tasks to store information about top_ranked so it can be reused in the evaluator instead of in the model.

Yeah this is a nice approach! It would be basically like creating a mini-reranking task that then is reproducible. The only issue I could see is that any leaderboard would have to use the same top_ranked but it would solve reproducibility issues which is big.

@Samoed
Copy link
Member Author

Samoed commented Aug 9, 2025

So, @orionw can you review my implementation then? I will work on multistage on another pr

@orionw
Copy link
Contributor

orionw commented Aug 9, 2025

Yes, sorry, will review. Is it this one?

@Samoed
Copy link
Member Author

Samoed commented Aug 9, 2025

Yes, it is

Copy link
Contributor

@orionw orionw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the great PR!

This looks like it maintains all functionality w.r.t. top_ranked and adds the index option so we can add more late interaction models and larger-scale dense datasets.

cc as an FYI for @NohTow who can hopefully add all the PyLate functionality soon!

@Samoed
Copy link
Member Author

Samoed commented Aug 16, 2025

@KennethEnevoldsen can you review PR and give your thoughts about 2 stage reranking?

@KennethEnevoldsen
Copy link
Contributor

Yeah, generally think that this looks really good - very happy about where we are heading

Let us do the two-stage in another PR.

I am more than happy with disabling the test here and merging

@Samoed Samoed enabled auto-merge (squash) August 17, 2025 10:28
@Samoed Samoed merged commit cb53c56 into v2.0.0 Aug 17, 2025
9 checks passed
@Samoed Samoed deleted the integrate_search_interface branch August 17, 2025 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2 Issues and PRs related to `v2` branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants