-
Notifications
You must be signed in to change notification settings - Fork 463
[v2] Integrate search interface #2970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR integrates a new search interface to replace the existing DenseRetrievalExactSearch
approach. It introduces a unified SearchInterface
protocol that supports both encoder-based and cross-encoder models for retrieval tasks, providing a more flexible and extensible architecture.
Key changes:
- Introduces
SearchProtocol
andCrossEncoderProtocol
interfaces for standardized model interaction - Refactors retrieval evaluation to use wrapper classes (
SearchEncoderWrapper
,SearchCrossEncoderWrapper
) - Consolidates model protocols into a centralized
models_protocols.py
file
Reviewed Changes
Copilot reviewed 63 out of 63 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
mteb/models/models_protocols.py |
Defines new search and cross-encoder protocols with standardized interfaces |
mteb/models/search_wrappers.py |
Implements wrapper classes that adapt encoders and cross-encoders to the search interface |
mteb/evaluation/evaluators/retrieval_evaluator.py |
Refactors to use the new search interface instead of DenseRetrievalExactSearch |
mteb/models/sentence_transformer_wrapper.py |
Renames classes and separates cross-encoder functionality |
Various model implementation files | Updates import paths and class names to use new wrapper naming |
I've implemented a search wrapper for the cross-encoder to support only reranking tasks. I think the mteb/mteb/evaluation/evaluators/dense_retrieval_exact_search.py Lines 460 to 468 in 64478e7
I think we can change to encode all pairs from qrels instead. For now, I think we don't want to support two-stage encoding. We can add it later. |
Also I'm not sure if we need mteb/mteb/evaluation/evaluators/dense_retrieval_exact_search.py Lines 279 to 283 in 64478e7
I've look to the some datasets and they have same information in CC @orionw |
* combine instructions with queries * fix old format ds
I think these are related. My guess is there are some datasets in MTEB for reranking that have nearly identical qrels and top_ranked, but that they are other datasets where that also doesn't happen (and/or future datasets/flexibility).
We do want to only pick the ids they give for reranking, but it should differ per query. I think |
For now I think they're always the same, because they were reuploaded by you.
But this can be achived by using qrels too
I think same information is stored in qrels |
if top_ranked is None: | ||
raise ValueError( | ||
"CrossEncoder search requires top_ranked documents for reranking." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented cross-encoders to support reranking task only, but we have test:
- that testing cross-encoders on retrieval
- 2 stage reranking
Do we want to support these cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what @KennethEnevoldsen thinks, but the 2nd stage reranking is nearly identical to just normal reranking and almost comes for free. I would vote to keep it if there's not much added complexity (unless things have changed and now there's more complexity)
@KennethEnevoldsen What do you thing about |
Not always. Maybe the easiest example is that you don't need to include non-relevant documents in qrels. So if I were to upload a new reranking dataset with only positive qrels then we'd only ever get 100% (since we're only reranking the positives with no negatives). I think the old format used to always specify that they were either non-relevant or relevant, and so when I uploaded them those went to the qrels even though they wouldn't usually (but for equal scores with the old format they had to be). But you don't have to specify the non-relevant documents for most retrieval/reranking datasets so forcing it to be that way would limit future datasets. |
We should at least have it in v2. However, we might limit the scope of this PR and just create an issue on it.
I am thinking something like this:
The hope is that this would maintain a fairly simple evaluation loop and move the two-stage into a different component. WDYT? Note: This does not support saving the errors (which is also not supported in regarding I am not entirely sure what the right move is, but I would probably trust @orionw's intuition more than my own here. In which case I would keep them. However, if there are no cases where they are different, then I am not sure it is needed. Is it used to compute certain metrics? |
It was only used to take pairs of query id, doc id for reranking tasks for the encoders. Also I'm not sure if cross-encoders should support retrieval tasks or not. If we make them to use qrels, then it can be used with both reranking and and retrieval. If only top ranked then retrieval task only (as implemented now)
I'm not sure about wrappers for the models, because this problem is more related to task, than to the models and results of previous run should be passed to the evaluator somehow. Also we might want to find a solution how to save results of these run, e.g. to show them on leaderboard |
I think they should. We want people to evaluate realistic use-cases. I would argue that it is more realistic than the reranking case. (@orionw)
Hmm I don't think it is a concern for the evaluator. The model can respond with the same documents as the previous model + its own edits. No reason for the evaluator to know what the previous model predicted. |
So cross-encoders would use only qrels or we can make them to use qrels by default and too_ranked if they passed.
But in that case evaluator will pass pairs of quries and documents (as standart retrieval) to the model and will expect their scores and I don't understand how model will use top ranked then. |
There are cases where they are different - at least all of the The somewhat long background is that for MTEB v1, the reranking tasks used to use custom metrics and contain these fields:
In #1359 we consolidated retrieval/reranking so they could both use the TREC evaluator and now they share much of the same logic. So in order to maintain the same scores as MTEB v1 reranking, we had to add all the negatives to the negative qrels. But that is not how reranking is normally done, so it's a bit of a hack. More recent and future datasets won't do it this way. Any reranking task formulation has to have both
Sorry, I don't think I fully follow this @Samoed. Cross-Encoders shouldn't generally do retrieval without
I do think we will have to come up with a solution for leaderboard style evaluation, but I think the wrapper and current solution works great for practical evaluation/use. I think the wrapper style would be quite easy to use @KennethEnevoldsen! |
I agree with you, but in our tests, one of the failures in this branch is that cross-encoders are being tested on retrieval tasks (other cases are two-stage reranking, but still on retrieval) and in main branch we have handling of cross-encoders in retrieval
I think we could create a wrapper for tasks to store information about |
Ah I see. Yeah I think we could drop that test and honestly we should print a warning to the user if they are doing a retrieval task with a cross-encoder. It's possible they mean to (and we could allow it for people who really want to or are using small datasets), but more likely they are making a mistake.
Yeah this is a nice approach! It would be basically like creating a mini-reranking task that then is reproducible. The only issue I could see is that any leaderboard would have to use the same |
So, @orionw can you review my implementation then? I will work on multistage on another pr |
Yes, sorry, will review. Is it this one? |
Yes, it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the great PR!
This looks like it maintains all functionality w.r.t. top_ranked
and adds the index option so we can add more late interaction models and larger-scale dense datasets.
cc as an FYI for @NohTow who can hopefully add all the PyLate functionality soon!
@KennethEnevoldsen can you review PR and give your thoughts about 2 stage reranking? |
Yeah, generally think that this looks really good - very happy about where we are heading Let us do the two-stage in another PR. I am more than happy with disabling the test here and merging |
Closes #2728
Added this interface instead of
DenseRetrievalExactSearch