Skip to content

Conversation

Samoed
Copy link
Member

@Samoed Samoed commented Aug 17, 2025

I've created 2 stage reranking. Here is example how to run it:

from sentence_transformers import CrossEncoder

import mteb

model = mteb.get_model("minishlab/potion-base-2M")
task = mteb.get_task("NanoArguAnaRetrieval")
prediction_folder = "results_folder"

res = mteb.evaluate(
    model,
    task,
    prediction_folder=prediction_folder,
    # overwrite_strategy="always",
)
task = task.convert_to_reranking(task.predictions_path(prediction_folder), top_k=100)
model = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2")
mteb.evaluate(
    model,
    task,
)

I've made saving information about run results to result file for better reproduction.

@Samoed Samoed added the v2 Issues and PRs related to `v2` branch label Aug 18, 2025
@Samoed Samoed marked this pull request as ready for review August 18, 2025 07:45
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm thanks for making the PR @Samoed.

I really don't like the save_retrieval_results=True argument as it clearly a task specific argument.

I do however like that it is possible to create a reranking from a retrieval task. How about

import mteb

model = mteb.get_model("minishlab/potion-base-2M")
retrieval_task = mteb.get_task("NanoArguAnaRetrieval")

model = SaveRetrievaPredictionsWrapper(model)

res = mteb.evaluate(model, retrieval_task)
reranking_task = retrieval_task.convert_to_reranking(model.predictions, top_k=100)

model = mteb.get_model("jinaai/jina-reranker-v2-base-multilingual")
mteb.evaluate(model, reranking_task)

Alternatively, we could add a more general save_predictions flag, that can be used for all tasks and just add a NotImplementedError() for all tasks but retrieval. I could see that as being an option as well. Besides the many not-implemented errors this would cause, it would also lead to a requirement to handle new arguments, such as where to save predictions, etc.

The Wrapper isolate the concern and leave the evalautor only focused on evaluation

@@ -187,12 +187,3 @@ def test_create_meta_from_existing(
command = f"{sys.executable} -m mteb create_meta --results_folder {results.as_posix()} --output_path {output_path.as_posix()} --from_existing {existing_readme.as_posix()} --overwrite"
result = subprocess.run(command, shell=True, capture_output=True, text=True)
assert result.returncode == 0, "Command failed"


def test_save_predictions():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this still used by MTEB()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this still used by MTEB, but I don't want to support it, because otherwise it would required to support a lot of extra parameters in AbsRetrieval which we don't want I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will break MTEB()?

Copy link
Member Author

@Samoed Samoed Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've chedked a bit futher. It will break only cli for some part, becuase we don't support this flag for now, but we need to refactor CLI a bit to use mteb.evaluate instead of MTEB. We can add support of this flag to save backward compatibility. But this flag is not supported by MTEB directly, it's just passed thought kwargs and not documented

def run(
self,
model: MTEBModels | CrossEncoder | SentenceTransformer,
verbosity: int = 1,
output_folder: str | None = "results",
eval_splits: list[str] | None = None,
eval_subsets: list[str] | None = None,
overwrite_results: bool = False,
raise_error: bool = True,
co2_tracker: bool = True,
encode_kwargs: dict[str, Any] | None = None,
**kwargs,
) -> list[TaskResult]:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah indeed!

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let is keep it like this, but remove it from the CLI args/docs

Copy link
Member Author

@Samoed Samoed Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we implement an additional method for save results, I think we can leave it as is

@Samoed
Copy link
Member Author

Samoed commented Aug 19, 2025

I really don't like the save_retrieval_results=True argument as it clearly a task specific argument.

I defenetly agree, but I don't see another solution to the problem. And this is because I didn't want to add it implement firstly.

model = SaveRetrievaPredictionsWrapper(model)

I don't think that this would solve the problems, because models can save only embedings, but not predictions for the retrieval task. I don't know how resuse results from the wrapper then.

Alternatively, we could add a more general save_predictions flag, that can be used for all tasks and just add a NotImplementedError() for all tasks but retrieval

I can make implementation for it, because this would solve problems for the tasks

@KennethEnevoldsen
Copy link
Contributor

I don't think that this would solve the problems, because models can save only embedings, but not predictions for the retrieval task. I don't know how resuse results from the wrapper then.

Why can't they save predicitons? doesn't the search interface return RetrievalOutputType?

I can make implementation for it, because this would solve problems for the tasks

Let us decide on a good solutions before spending the implementation time

@Samoed
Copy link
Member Author

Samoed commented Aug 20, 2025

Why can't they save predicitons? doesn't the search interface return RetrievalOutputType?

Oh, I forgot that we have search interface. Yes, wrapper should work then

Copy link
Contributor

@orionw orionw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. In general this looks fine to me.

As a meta-comment though, I will say I'm nervous we're adding too many wrappers. Do they stack? Can we save predictions and cache embeddings and ...?

Maybe a personal preference but I think arguments to functions are a much nicer UI from a users perspective than doing a bunch of wrappers. However, we have so many it's probably too late to change all of these at this point. Just worth noting for future extensions.

@Samoed
Copy link
Member Author

Samoed commented Aug 20, 2025

Can we save predictions and cache embeddings and ...?

I thought about this, too. This will be like RetrievalSaveResultsWrapper(CacheEmbeddingWrapper(model)) and should work

Maybe a personal preference but I think arguments to functions are a much nicer UI from a users perspective than doing a bunch of wrappers. However, we have so many it's probably too late to change all of these at this point. Just worth noting for future extensions.

In new mteb.evaluate we have just 5 parameters

mteb/mteb/evaluate.py

Lines 162 to 171 in cf01bc8

def evaluate(
model: ModelMeta | MTEBModels | SentenceTransformer | CrossEncoder,
tasks: AbsTask | Iterable[AbsTask],
*,
co2_tracker: bool | None = None,
raise_error: bool = True,
encode_kwargs: dict[str, Any] | None = None,
cache: ResultCache | None = ResultCache(),
overwrite_strategy: str | OverwriteStrategy = "only-missing",
) -> ModelResult:

def __init__(
self,
model: SearchProtocol,
results_path: str | Path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So based on @orionw and @Samoed comment, I might be better to add the following arguments:

save_retrieval_predictions: bool
prediction_save_path: Path

And then in the future we can convert raise a depreciationwarning on save_retrieval_predictions and instead refer to save_predictions.

let me know what you think?

(sorry @Samoed i understand that this is a digression from before)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might still want to keep the wrapper internally (without exposing it to the user)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then in the future we can convert raise a depreciationwarning on save_retrieval_predictions and instead refer to save_predictions.

Where do you want to add it? To mteb.evaluate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that would be the idea

Copy link
Member Author

@Samoed Samoed Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add method save_predictions to all tasks then? And we can use only prediction_save_path: Path then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but that required a lot of implementation across the different tasks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't like that we're passing task-specific args. Also, for some users this can be helpful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it was simply that we don't have it implemented yet so I am afraid it would give a poor user experience

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will implement it for retrieval in this pr and will add to other tasks later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright let us do it this way (sorry for backtracking)

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you redo the example in the top. I think this format looks pretty good though. Few updates on naming things

Samoed and others added 2 commits August 26, 2025 13:20
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
@Samoed
Copy link
Member Author

Samoed commented Aug 26, 2025

Updated example in description

@KennethEnevoldsen
Copy link
Contributor

Looks great @Samoed - I think this is good to merge - we need to update some docs away from using MTEB() for two-stage retrieval, but I can do that in a separate PR

@Samoed Samoed enabled auto-merge (squash) August 26, 2025 09:26
@Samoed Samoed merged commit 687cb78 into v2.0.0 Aug 26, 2025
13 of 14 checks passed
@Samoed Samoed deleted the two_stage_rerank branch August 26, 2025 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2 Issues and PRs related to `v2` branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants