[v2] Change `corpus` and `queries` to use `dataset` #2885

Samoed · 2025-07-06T19:50:37Z

I've updated retrieval split as following:

class RetrievalSplitData(TypedDict):
-    corpus: dict[str, dict[str, str]]
-    queries: dict[str, str]
+    corpus: Dataset
+    queries: Dataset
    relevant_docs: Mapping[str, Mapping[str, float]]
-    instructions: dict[str, str] | None
+    instructions: Dataset | None
    top_ranked: Mapping[str, list[str]] | None

I didn't change relevant_docs and top_ranked to a dataset because we're accessing these elements by ID, and using a dict is simpler for that. However, I think we can change it if needed.

I also modified the test a bit because we’re creating mock embeddings independently of the input. Previously, the corpus dict was sorted by keys and, for some reason, ended up as [d2, d1], so the first embedding was used. Maybe we should generate a random embedding based on the input instead. I’ve tested the branches on SCIDOCS, and the results didn’t change.

orionw

LGTM, thanks for the fix. I don't love that we have different types (dict and Dataset) but you're right it doesn't seem to make sense to force them to be datasets when we use them as dicts. And we have type annotations, so users can figure it out if they need to.

KennethEnevoldsen

~~Can you check that this still lead to the same results?~~ I see that you did that

Maybe we should generate a random embedding based on the input instead.

Yes. I have had that problem before as well. I have actually considered replacing the mockencoder with a small model2vec model (small enough to test with and is a more valid test model)

KennethEnevoldsen · 2025-07-25T15:17:29Z

mteb/abstasks/AbsTaskRetrieval.py

@@ -116,11 +120,13 @@ def __init__(self, **kwargs):
    def convert_v1_dataset_format_to_v2(self):


I am unsure how our changes influence the loading of v2 retrieval dataset (do we have any of those? It would be great to see how that would influence the load data setup.

How does this influence pushing datasets to the hub? Does it push in a v2 compatible manner, and do we need to update now?

Updated function for pushing datasets

I tried to measure time with this script

import mteb from datetime import datetime task = mteb.get_task("SCIDOCS") start = datetime.now() task.load_data() print((datetime.now() - start).seconds)

And for v2 I got times 10-30 seconds, and for this PR from 7-20 seconds. I don't understand why there is so huge difference between same runs on same branch (dataset was preloaded locally).

mteb/abstasks/AbsTaskRetrieval.py

mteb/create_dataloaders.py

mteb/evaluation/evaluators/model_classes.py

tests/test_benchmark/mock_tasks.py

KennethEnevoldsen · 2025-07-25T15:30:19Z

tests/test_tasks/test_mteb_rerank.py

+    results = sorted(
+        results["1"].keys(), key=lambda x: (results["1"][x], x), reverse=True
+    )[:2]


reversed=True. Isn't that a bit worrying?

I tried to debug this part and found that we’re handling the reranking task a bit incorrectly, because it passes a tuple[dict] to corpus_to_dict, and this ends up storing all information in the text field. This might be part of the issue discussed in #2933.

After making this change, I looked at the results, and they contain values like:

{'18670': -10.466630935668945, '4983': -8.812150955200195, '19238': -11.240396499633789}

So we need to sort them in reverse.

hmm, interesting - we should maybe test for the exact values here then - to ensure that we don't change this behaviour going forward

We of course also have to test #2933

Yes, I think we would create mock cross-encoders to test them more reproducible

tests/test_benchmark/test_models.py

Samoed · 2025-07-25T15:42:39Z

Yes. I have had that problem before as well. I have actually considered replacing the mockencoder with a small model2vec model

We can use approach as is in langchain fake embedding by setting seed from the text https://github.com/langchain-ai/langchain/blob/master/libs%2Fcore%2Flangchain_core%2Fembeddings%2Ffake.py#L120

KennethEnevoldsen · 2025-07-25T15:49:08Z

We can use approach as is in langchain fake embedding by setting seed from the text https://github.com/langchain-ai/langchain/blob/master/libs%2Fcore%2Flangchain_core%2Fembeddings%2Ffake.py#L120

Great idea! Let us do that instead.

Samoed · 2025-07-31T10:14:37Z

@KennethEnevoldsen can you rereview this?

# Conflicts: # mteb/create_dataloaders.py # mteb/types/_encoder_io.py

KennethEnevoldsen · 2025-08-02T20:13:30Z

tests/test_tasks/test_mteb_rerank.py

@@ -345,14 +345,17 @@ def test_mteb_rerank(tmp_path: Path):
        eval_splits=["test"],
        previous_results=tmp_file,
        save_predictions=True,
+        co2_tracker=False,


Yea, will be good to have it as default off in v2's evaluate

KennethEnevoldsen

A few minor things, generally looks very good!

mteb/models/encoder_interface.py

mteb/abstasks/_statistics_calculation.py

# Conflicts: # mteb/evaluation/evaluators/dense_retrieval_exact_search.py

Samoed added 3 commits July 6, 2025 22:29

change corpus and queries to dataset

d0eb2b2

remove commented out code

9ac554d

add convertion for v1 datasets

6b3d00c

Samoed changed the title ~~Retrieval ds~~ Change corpus and queries to use dataset Jul 6, 2025

Samoed added the v2 Issues and PRs related to `v2` branch label Jul 6, 2025

Samoed changed the title ~~Change corpus and queries to use dataset~~ [v2] Change corpus and queries to use dataset Jul 6, 2025

Samoed added 9 commits July 20, 2025 14:46

Merge branch 'v2.0.0' into retrieval_ds

ed38ff4

fix descriptive stats

b41c5e5

update reranking

e5db6e5

Merge branch 'v2.0.0' into retrieval_ds

6a61e17

format

9d28075

fix tests

670aad5

lint

6ae75a6

change ids of mock dataset

59d3959

change score for colbert

0708f32

Samoed marked this pull request as ready for review July 25, 2025 14:05

Samoed requested review from orionw and KennethEnevoldsen July 25, 2025 14:05

orionw approved these changes Jul 25, 2025

View reviewed changes

KennethEnevoldsen reviewed Jul 25, 2025

View reviewed changes

Samoed added 4 commits July 29, 2025 21:34

add type for corpus and queries datasets

e73ec0f

fix reranking task

2c5e21d

format

a72700b

update push to hub

8f79758

Samoed mentioned this pull request Jul 30, 2025

Create input specific mock model #2957

Open

Samoed added 2 commits July 30, 2025 08:58

update statistics calculation

5627f53

simplify create_dataloader_for_retrieval_corpus

76e6158

Samoed requested a review from KennethEnevoldsen July 30, 2025 06:25

remove check with queries id

223edeb

Samoed added 4 commits August 1, 2025 18:12

add instruction dataset type

977a9af

fully annotate retrieval types

0ce471a

remove irrelevant type annotation

5253c6e

Merge branch 'v2.0.0' into retrieval_ds

365bd1d

# Conflicts: # mteb/create_dataloaders.py # mteb/types/_encoder_io.py

KennethEnevoldsen reviewed Aug 2, 2025

View reviewed changes

KennethEnevoldsen approved these changes Aug 2, 2025

View reviewed changes

mteb/models/encoder_interface.py Show resolved Hide resolved

mteb/abstasks/_statistics_calculation.py Show resolved Hide resolved

Samoed mentioned this pull request Aug 4, 2025

Remove above/below from documentation #2982

Open

Merge branch 'v2.0.0' into retrieval_ds

2de1125

# Conflicts: # mteb/evaluation/evaluators/dense_retrieval_exact_search.py

Samoed enabled auto-merge (squash) August 4, 2025 05:04

fix imports

ded751d

Samoed merged commit cba95e7 into v2.0.0 Aug 4, 2025
9 checks passed

Samoed deleted the retrieval_ds branch August 4, 2025 06:14

		@@ -116,11 +120,13 @@ def __init__(self, **kwargs):
		def convert_v1_dataset_format_to_v2(self):

[v2] Change corpus and queries to use dataset #2885

[v2] Change corpus and queries to use dataset #2885

Uh oh!

Conversation

Samoed commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionw left a comment

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Samoed commented Jul 25, 2025

Uh oh!

KennethEnevoldsen commented Jul 25, 2025

Uh oh!

Samoed commented Jul 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[v2] Change `corpus` and `queries` to use `dataset` #2885

[v2] Change `corpus` and `queries` to use `dataset` #2885

Samoed commented Jul 6, 2025 •

edited

Loading

KennethEnevoldsen left a comment •

edited

Loading

Samoed Jul 30, 2025 •

edited

Loading