[v2] Combine instructions with queries #2984

Samoed · 2025-08-04T05:59:17Z

I've merged instruction and queries datasets. Now queries dataset would look like

{"id": "123", "text": "some text", "insruction": "instruction"}

For now, 3 tests are failing because of change in SearchInterface. I tried to run mFollowIR and Core17InstructionRetrieval and they have the same results as in v2 branch.

…ons_to_query2

orionw

LGTM, assuming that the eval still runs and looks the same.

This must create rows that have duplicate queries (so query size for InstructIR will 10x to be each combination of instruction + query instead of just queries). But that's fine to me since they are relatively small datasets.

mteb/create_dataloaders.py

Samoed · 2025-08-06T19:49:07Z

For InstructIR queries are already duplicated to match with instructions https://huggingface.co/datasets/mteb/InstructIR-mteb/viewer/queries

Also in docstring we have that we're alredy need to duplicate queries to match with instruction

mteb/mteb/abstasks/AbsTaskRetrieval.py

Line 95 in 64478e7

    
                                   Semantically, it should contain dict[split_name, dict[sample_id, str]]. If there are multiple instructions per query, please duplicate the queries and give them unique ids for consolidation.

orionw · 2025-08-06T20:59:32Z

I see, thanks for pointing out. I think the FollowIR and mFollowIR tasks don't require duplication and the only other Instruction* task I see is IFIR, which seems to only have one instruction per query. So I think we're good then?

Samoed · 2025-08-07T04:16:38Z

I think yes. Can you also review #2970. I have questions here about top_ranked and how to integrate cross-encoders

…ons_to_query2 # Conflicts: # mteb/abstasks/AbsTaskRetrieval.py # mteb/models/search_wrappers.py

KennethEnevoldsen · 2025-08-07T09:28:36Z

Looks good to me as well

* change corpus and queries to dataset * remove commented out code * add convertion for v1 datasets * fix descriptive stats * update reranking * format * fix tests * lint * change ids of mock dataset * change score for colbert * add type for corpus and queries datasets * fix reranking task * format * update push to hub * update statistics calculation * simplify `create_dataloader_for_retrieval_corpus` * remove check with queries id * add instruction dataset type * fully annotate retrieval types * remove irrelevant type annotation * init * base search interface implementation * base search interface implementation * add todo comment * add link to todo * Update mteb/models/search/search_crossencoder.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update mteb/create_dataloaders.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove search folder * fix imports * fix tests * add support for cross encoder models * combine back encoder * add additional check for interface * resolve copilot comment * fix union type * roll back rename in validate_task_to_prompt_name * fix descriptive stats * [v2] Combine instructions with queries (#2984) * combine instructions with queries * fix old format ds * rename `MtebSupportedModelProtocols` and add `RetrievalEvaluationResult` tuple --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

combine instructions with queries

5751382

Samoed requested review from KennethEnevoldsen and orionw August 4, 2025 05:59

Samoed added the v2 Issues and PRs related to `v2` branch label Aug 4, 2025

Samoed changed the title ~~combine instructions with queries~~ [v2] Combine instructions with queries Aug 4, 2025

Samoed added 2 commits August 4, 2025 10:59

Merge branch 'integrate_search_interface' into v2/integrate_instracti…

360d3b2

…ons_to_query2

fix old format ds

5fe1371

orionw approved these changes Aug 6, 2025

View reviewed changes

mteb/create_dataloaders.py Show resolved Hide resolved

Samoed linked an issue Aug 7, 2025 that may be closed by this pull request

Merge instruction with queries for retrieval #2969

Closed

Merge branch 'integrate_search_interface' into v2/integrate_instracti…

54513c9

…ons_to_query2 # Conflicts: # mteb/abstasks/AbsTaskRetrieval.py # mteb/models/search_wrappers.py

KennethEnevoldsen approved these changes Aug 7, 2025

View reviewed changes

Samoed merged commit 85078c3 into integrate_search_interface Aug 7, 2025
4 of 9 checks passed

Samoed deleted the v2/integrate_instractions_to_query2 branch August 7, 2025 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v2] Combine instructions with queries #2984

[v2] Combine instructions with queries #2984

Uh oh!

Samoed commented Aug 4, 2025 •

edited

Loading

Uh oh!

orionw left a comment

Uh oh!

Uh oh!

Samoed commented Aug 6, 2025

Uh oh!

orionw commented Aug 6, 2025

Uh oh!

Samoed commented Aug 7, 2025

Uh oh!

KennethEnevoldsen commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

[v2] Combine instructions with queries #2984

[v2] Combine instructions with queries #2984

Uh oh!

Conversation

Samoed commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Samoed commented Aug 6, 2025

Uh oh!

orionw commented Aug 6, 2025

Uh oh!

Samoed commented Aug 7, 2025

Uh oh!

KennethEnevoldsen commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

Samoed commented Aug 4, 2025 •

edited

Loading