[v2] Integrate search interface #2970

Samoed · 2025-08-02T09:35:50Z

Added this interface instead of DenseRetrievalExactSearch

class SearchInterface(Protocol):
    """Interface for searching models."""

    def index(
        self,
        corpus: CorpusDatasetType,
        *,
        task_metadata: TaskMetadata,
        hf_split: str,
        hf_subset: str,
        encode_kwargs: dict[str, Any],
    ) -> None:
        ...

    def search(
        self,
        queries: QueryDatasetType,
        *,
        task_metadata: TaskMetadata,
        hf_split: str,
        hf_subset: str,
        top_k: int,
        encode_kwargs: dict[str, Any],
        instructions: InstructionDatasetType | None = None,
        top_ranked: TopRankedDocumentsType | None = None,
    ) -> RetrievalOutputType:
     ...

Copilot

Pull Request Overview

This PR integrates a new search interface to replace the existing DenseRetrievalExactSearch approach. It introduces a unified SearchInterface protocol that supports both encoder-based and cross-encoder models for retrieval tasks, providing a more flexible and extensible architecture.

Key changes:

Introduces SearchProtocol and CrossEncoderProtocol interfaces for standardized model interaction
Refactors retrieval evaluation to use wrapper classes (SearchEncoderWrapper, SearchCrossEncoderWrapper)
Consolidates model protocols into a centralized models_protocols.py file

Reviewed Changes

Copilot reviewed 63 out of 63 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`mteb/models/models_protocols.py`	Defines new search and cross-encoder protocols with standardized interfaces
`mteb/models/search_wrappers.py`	Implements wrapper classes that adapt encoders and cross-encoders to the search interface
`mteb/evaluation/evaluators/retrieval_evaluator.py`	Refactors to use the new search interface instead of `DenseRetrievalExactSearch`
`mteb/models/sentence_transformer_wrapper.py`	Renames classes and separates cross-encoder functionality
Various model implementation files	Updates import paths and class names to use new wrapper naming

mteb/abstasks/AbsTaskRetrieval.py

mteb/models/search_wrappers.py

mteb/models/abs_encoder.py

Samoed · 2025-08-05T06:24:45Z

I've implemented a search wrapper for the cross-encoder to support only reranking tasks. I think the v2 implementation of cross-encoder search is a bit incorrect. In this part, we're selecting all corpus IDs, and I believe it always selects the same IDs every time:

mteb/mteb/evaluation/evaluators/dense_retrieval_exact_search.py

Lines 460 to 468 in 64478e7

    
               q_results = dict.fromkeys(corpus["id"], 0.0) 
        
           else: 
        
               q_results = self.previous_results[qid] 
        
           # take the top-k only 
        
           q_results_sorted = dict( 
        
               sorted(q_results.items(), key=lambda item: item[1], reverse=True) 
        
           ) 
        
           top_n = [k for k, v in list(q_results_sorted.items())[:top_k]] 
        
           query = row["text"]

I think we can change to encode all pairs from qrels instead.

For now, I think we don't want to support two-stage encoding. We can add it later.

Samoed · 2025-08-05T06:30:38Z

Also I'm not sure if we need top_ranked split if it have same logic and values as in qrels (relevant_document) split? It's used only to get corpus ids of query, but I think this can be done with relevant_document too.

mteb/mteb/evaluation/evaluators/dense_retrieval_exact_search.py

Lines 279 to 283 in 64478e7

    
           if query_id not in top_ranked: 
        
               logger.warning(f"No pre-ranked documents found for query {query_id}") 
        
               continue 
        
           ranked_ids = top_ranked[query_id]

I've look to the some datasets and they have same information in qrels and top_ranked

CC @orionw

* combine instructions with queries * fix old format ds

orionw · 2025-08-07T12:44:09Z

Also I'm not sure if we need top_ranked split if it have same logic and values as in qrels (relevant_document) split? It's used only to get corpus ids of query, but I think this can be done with relevant_document too.

For now, I think we don't want to support two-stage encoding. We can add it later.

I think these are related. top_ranked is simulating the 2nd stage pass where your first stage has found some documents but you aren't sure if they're relevant or not. If you just used the qrels than for some datasets then you'd be limited to reranking documents that have been judged (which in some cases is just relevant docs).

My guess is there are some datasets in MTEB for reranking that have nearly identical qrels and top_ranked, but that they are other datasets where that also doesn't happen (and/or future datasets/flexibility).

I've implemented a search wrapper for the cross-encoder to support only reranking tasks. I think the v2 implementation of cross-encoder search is a bit incorrect. In this part, we're selecting all corpus IDs, and I believe it always selects the same IDs every time:

We do want to only pick the ids they give for reranking, but it should differ per query. I think top_ranked is a mapping for this reason, to select the ids to use for each query.

Samoed · 2025-08-07T13:02:25Z

My guess is there are some datasets in MTEB for reranking that have nearly identical qrels and top_ranked, but that they are other datasets where that also doesn't happen (and/or future datasets/flexibility).

For now I think they're always the same, because they were reuploaded by you.

top_ranked is simulating the 2nd stage pass where your first stage has found some documents but you aren't sure if they're relevant or not.

But this can be achived by using qrels too

We do want to only pick the ids they give for reranking, but it should differ per query. I think top_ranked is a mapping for this reason, to select the ids to use for each query.

I think same information is stored in qrels

Samoed · 2025-08-07T13:20:13Z

mteb/models/search_wrappers.py

+        if top_ranked is None:
+            raise ValueError(
+                "CrossEncoder search requires top_ranked documents for reranking."
+            )


I've implemented cross-encoders to support reranking task only, but we have test:

that testing cross-encoders on retrieval

2 stage reranking
Do we want to support these cases?

+1 to what @KennethEnevoldsen thinks, but the 2nd stage reranking is nearly identical to just normal reranking and almost comes for free. I would vote to keep it if there's not much added complexity (unless things have changed and now there's more complexity)

Samoed · 2025-08-07T14:10:24Z

@KennethEnevoldsen What do you thing about relevant_docs/top_ranked?

orionw · 2025-08-07T14:28:55Z

But this can be achived by using qrels too

Not always. Maybe the easiest example is that you don't need to include non-relevant documents in qrels. So if I were to upload a new reranking dataset with only positive qrels then we'd only ever get 100% (since we're only reranking the positives with no negatives).

I think the old format used to always specify that they were either non-relevant or relevant, and so when I uploaded them those went to the qrels even though they wouldn't usually (but for equal scores with the old format they had to be).

But you don't have to specify the non-relevant documents for most retrieval/reranking datasets so forcing it to be that way would limit future datasets.

KennethEnevoldsen · 2025-08-09T14:18:34Z

For now, I think we don't want to support two-stage encoding. We can add it later.

We should at least have it in v2. However, we might limit the scope of this PR and just create an issue on it.

mteb.evaluate currently doesn't support it, though, but it should be possible to solve this using a wrapper:

I am thinking something like this:

retrieval_task = ... 
model_stage1 = ...
model_stage2 = ...

model_stage1 = SavePredictionsWrapper(model_stage1, path=...)
res = mteb.evaluate(model_stage1, retrieval_task)

model_stage2 = LoadPredictions(model_stage2, path=...)
res = mteb.evaluate(model_stage1, retrieval_task)

The hope is that this would maintain a fairly simple evaluation loop and move the two-stage into a different component. WDYT?

Note: This does not support saving the errors (which is also not supported in mteb.evaluate, but it also currently not documented anywhere.

regarding qrels and top_ranked

I am not entirely sure what the right move is, but I would probably trust @orionw's intuition more than my own here. In which case I would keep them.

However, if there are no cases where they are different, then I am not sure it is needed. Is it used to compute certain metrics?

Samoed · 2025-08-09T14:37:39Z

However, if there are no cases where they are different, then I am not sure it is needed. Is it used to compute certain metrics?

It was only used to take pairs of query id, doc id for reranking tasks for the encoders.

Also I'm not sure if cross-encoders should support retrieval tasks or not. If we make them to use qrels, then it can be used with both reranking and and retrieval. If only top ranked then retrieval task only (as implemented now)

The hope is that this would maintain a fairly simple evaluation loop and move the two-stage into a different component. WDYT?

I'm not sure about wrappers for the models, because this problem is more related to task, than to the models and results of previous run should be passed to the evaluator somehow.

Also we might want to find a solution how to save results of these run, e.g. to show them on leaderboard

KennethEnevoldsen · 2025-08-09T14:42:33Z

Also I'm not sure if cross-encoders should support retrieval tasks or not. If we make them to use qrels, then it can be used with both reranking and and retrieval. If only top ranked then retrieval task only (as implemented now)

I think they should. We want people to evaluate realistic use-cases. I would argue that it is more realistic than the reranking case.

(@orionw)

I'm not sure about wrappers for the models, because this problem is more related to task, than to the models and results of previous run should be passed to the evaluator somehow.

Hmm I don't think it is a concern for the evaluator. The model can respond with the same documents as the previous model + its own edits. No reason for the evaluator to know what the previous model predicted.

Samoed · 2025-08-09T14:49:45Z

I think they should

So cross-encoders would use only qrels or we can make them to use qrels by default and too_ranked if they passed.

Hmm I don't think it is a concern for the evaluator.

But in that case evaluator will pass pairs of quries and documents (as standart retrieval) to the model and will expect their scores and I don't understand how model will use top ranked then.

orionw · 2025-08-09T14:58:03Z

However, if there are no cases where they are different, then I am not sure it is needed. Is it used to compute certain metrics?

There are cases where they are different - at least all of the InstructionReranking tasks

The somewhat long background is that for MTEB v1, the reranking tasks used to use custom metrics and contain these fields:

        query: str
        positive: list[str]
        negative: list[str]

In #1359 we consolidated retrieval/reranking so they could both use the TREC evaluator and now they share much of the same logic.

So in order to maintain the same scores as MTEB v1 reranking, we had to add all the negatives to the negative qrels. But that is not how reranking is normally done, so it's a bit of a hack. More recent and future datasets won't do it this way.

Any reranking task formulation has to have both qrels and top_ranked since they are generally quite disinct. Imagine I wanted to create a new NQ reranking task (NQ has two positives per query): NQ doesn't have qrels for documents that are non-relevant since humans only validated the positives. If I didn't upload a top_ranked, it would only rerank from the qrels, which would only rerank the two relevant documents per query. This would always achieve 100% on all metrics. The top_ranked field instead tells the model what to rerank and the qrels is what has been judged to be relevant or not.

Also I'm not sure if cross-encoders should support retrieval tasks or not. If we make them to use qrels, then it can be used with both reranking and and retrieval. If only top ranked then retrieval task only (as implemented now)

Sorry, I don't think I fully follow this @Samoed. Cross-Encoders shouldn't generally do retrieval without top_ranked since they would have to rerank every combination of query and doc (very expensive), although it is technically feasible. But Embedding models could do either one and should already by able to handle both.

Also we might want to find a solution how to save results of these run, e.g. to show them on leaderboard

I do think we will have to come up with a solution for leaderboard style evaluation, but I think the wrapper and current solution works great for practical evaluation/use.

I think the wrapper style would be quite easy to use @KennethEnevoldsen!

Samoed · 2025-08-09T15:11:11Z

Cross-Encoders shouldn't generally do retrieval without top_ranked since they would have to rerank every combination of query and doc (very expensive), although it is technically feasible.

I agree with you, but in our tests, one of the failures in this branch is that cross-encoders are being tested on retrieval tasks (other cases are two-stage reranking, but still on retrieval) and in main branch we have handling of cross-encoders in retrieval

mteb/mteb/evaluation/evaluators/RetrievalEvaluator.py

Line 245 in 4adf565

def search_cross_encoder(

, and that’s why I raised this question.

I do think we will have to come up with a solution for leaderboard style evaluation, but I think the wrapper and current solution works great for practical evaluation/use.

I think we could create a wrapper for tasks to store information about top_ranked so it can be reused in the evaluator instead of in the model.

orionw · 2025-08-09T15:14:31Z

I agree with you, but in our tests, one of the failures in this branch is that cross-encoders are being tested on retrieval tasks (other cases are two-stage reranking, but still on retrieval), and that’s why I raised this question.

Ah I see. Yeah I think we could drop that test and honestly we should print a warning to the user if they are doing a retrieval task with a cross-encoder. It's possible they mean to (and we could allow it for people who really want to or are using small datasets), but more likely they are making a mistake.

I think we could create a wrapper for tasks to store information about top_ranked so it can be reused in the evaluator instead of in the model.

Yeah this is a nice approach! It would be basically like creating a mini-reranking task that then is reproducible. The only issue I could see is that any leaderboard would have to use the same top_ranked but it would solve reproducibility issues which is big.

Samoed · 2025-08-09T15:22:01Z

So, @orionw can you review my implementation then? I will work on multistage on another pr

orionw · 2025-08-09T16:24:33Z

Yes, sorry, will review. Is it this one?

Samoed · 2025-08-09T16:49:43Z

Yes, it is

orionw

LGTM, thanks for the great PR!

This looks like it maintains all functionality w.r.t. top_ranked and adds the index option so we can add more late interaction models and larger-scale dense datasets.

cc as an FYI for @NohTow who can hopefully add all the PyLate functionality soon!

mteb/evaluate.py

mteb/evaluation/MTEB.py

mteb/models/models_protocols.py

mteb/models/sentence_transformer_wrapper.py

Samoed · 2025-08-16T15:37:27Z

@KennethEnevoldsen can you review PR and give your thoughts about 2 stage reranking?

mteb/models/models_protocols.py

KennethEnevoldsen · 2025-08-16T18:36:54Z

Yeah, generally think that this looks really good - very happy about where we are heading

Let us do the two-stage in another PR.

I am more than happy with disabling the test here and merging

…lt` tuple

Samoed and others added 26 commits July 6, 2025 22:29

change corpus and queries to dataset

d0eb2b2

remove commented out code

9ac554d

add convertion for v1 datasets

6b3d00c

Merge branch 'v2.0.0' into retrieval_ds

ed38ff4

fix descriptive stats

b41c5e5

update reranking

e5db6e5

Merge branch 'v2.0.0' into retrieval_ds

6a61e17

format

9d28075

fix tests

670aad5

lint

6ae75a6

change ids of mock dataset

59d3959

change score for colbert

0708f32

add type for corpus and queries datasets

e73ec0f

fix reranking task

2c5e21d

format

a72700b

update push to hub

8f79758

update statistics calculation

5627f53

simplify create_dataloader_for_retrieval_corpus

76e6158

remove check with queries id

223edeb

add instruction dataset type

977a9af

fully annotate retrieval types

0ce471a

remove irrelevant type annotation

5253c6e

init

0e07565

base search interface implementation

280a0d9

base search interface implementation

1fe6f8f

add todo comment

d8ecd77

Samoed requested review from KennethEnevoldsen, orionw, isaac-chung and Copilot August 2, 2025 09:35

Samoed requested a review from Copilot August 5, 2025 06:16

Copilot AI reviewed Aug 5, 2025

View reviewed changes

mteb/abstasks/AbsTaskRetrieval.py Show resolved Hide resolved

mteb/models/search_wrappers.py Show resolved Hide resolved

mteb/models/search_wrappers.py Show resolved Hide resolved

mteb/models/abs_encoder.py Show resolved Hide resolved

Samoed mentioned this pull request Aug 7, 2025

[v2] Combine instructions with queries #2984

Merged

Samoed added 2 commits August 7, 2025 15:01

[v2] Combine instructions with queries (#2984)

85078c3

* combine instructions with queries * fix old format ds

Merge branch 'v2.0.0' into integrate_search_interface

87eaee3

Samoed commented Aug 7, 2025

View reviewed changes

orionw approved these changes Aug 11, 2025

View reviewed changes

KennethEnevoldsen reviewed Aug 16, 2025

View reviewed changes

mteb/models/models_protocols.py Outdated Show resolved Hide resolved

rename MtebSupportedModelProtocols and add `RetrievalEvaluationResu…

f4477ce

…lt` tuple

Samoed enabled auto-merge (squash) August 17, 2025 10:28

Samoed merged commit cb53c56 into v2.0.0 Aug 17, 2025
9 checks passed

Samoed deleted the integrate_search_interface branch August 17, 2025 10:29

[v2] Integrate search interface #2970

[v2] Integrate search interface #2970

Uh oh!

Conversation

Samoed commented Aug 2, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samoed commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionw commented Aug 7, 2025

Uh oh!

Samoed commented Aug 7, 2025

Uh oh!

Samoed Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

orionw Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed commented Aug 7, 2025

Uh oh!

orionw commented Aug 7, 2025

Uh oh!

KennethEnevoldsen commented Aug 9, 2025

Uh oh!

Samoed commented Aug 9, 2025

Uh oh!

KennethEnevoldsen commented Aug 9, 2025

Uh oh!

Samoed commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionw commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionw commented Aug 9, 2025

Uh oh!

Samoed commented Aug 9, 2025

Uh oh!

orionw commented Aug 9, 2025

Uh oh!

Samoed commented Aug 9, 2025

Uh oh!

orionw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samoed commented Aug 16, 2025

Uh oh!

Uh oh!

KennethEnevoldsen commented Aug 16, 2025

Uh oh!

Uh oh!

Uh oh!

Samoed commented Aug 5, 2025 •

edited

Loading

Samoed commented Aug 5, 2025 •

edited

Loading

Samoed commented Aug 9, 2025 •

edited

Loading

orionw commented Aug 9, 2025 •

edited

Loading

Samoed commented Aug 9, 2025 •

edited

Loading