fix: Add IFIR relevant tasks #2763

SighingSnow · 2025-06-03T12:31:02Z

I have outlined why this dataset is filling an existing gap in mteb
I have tested that the dataset runs with the mteb package.
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-smal
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).

Signed-off-by: SighingSnow <songtingyu220@gmail.com>

KennethEnevoldsen

Congratulations on the paper and thanks for the PR! There seems to be a few missing things but nothing major. Running a couple of models on the benchmark would be great.

I examined a bit of the data to figure out how you handled the instruction without using the instruction retrieval task, but found a few odd cases:

How Beans Help Our Bones I am a student researching the impact [...]

I suspect it is:

Query: How Beans Help Our Bones
Instruction: I am a student researching the impact [...]

The casing also seems like a (consistent) formatting issue.

@orionw since it is instruction following I just want to get your eyes on this as well.

KennethEnevoldsen · 2025-06-03T18:13:45Z

mteb/tasks/Retrieval/eng/IFIRCdsRetrieval.py

+        eval_splits=["test"],
+        eval_langs=["eng-Latn"],
+        main_score="ndcg_at_20",
+        date=None,


There are still a bunch of annotations missing. These will need to be filled out

KennethEnevoldsen · 2025-06-03T18:15:06Z

mteb/tasks/Retrieval/eng/IFIRCdsRetrieval.py

+            "path": "if-ir/cds",
+            "revision": "c406767",
+        },
+        description="BBenchmark IFIR cds subset within instruction following abilities.",


The description has to be detailed enough for me to get a gist of what the dataset contains.

KennethEnevoldsen · 2025-06-03T18:16:57Z

mteb/benchmarks/benchmarks.py

@@ -274,6 +274,33 @@
 """,
 )

+MTEB_RETRIEVAL_WITH_DOMAIN_INSTRUCTIONS = Benchmark(


If we create the benchmark you will have to submit some models that have been evaluated on it as well

Yes, sure. I will further run some models on it.

KennethEnevoldsen · 2025-06-03T18:17:38Z

mteb/tasks/Retrieval/eng/IFIRAilaRetrieval.py

+        eval_langs=["eng-Latn"],
+        main_score="ndcg_at_20",
+        date=None,
+        # domains=["Medical", "Academic", "Written"],


Please, do annotate domains

Yes, no problem.

orionw

Awesome to see this, thanks @SighingSnow!

I examined a bit of the data to figure out how you handled the instruction without using the instruction retrieval task, but found a few odd cases:

My understanding of IFIR is that there is one instruction per query but that instruction was done at three hierarchies, is that right @SighingSnow? I assume the most expansive/longest query is the one you want other people to evaluate on, but please correct me if I'm wrong.

If there is only one instruction per query, it could be formatted as Retrieval (they share the same underlying class AbstractRetrieval) but I think it would make more sense as InstructionRetrieval of course.

I think there are just a few minor things you need for that: to create a separate split of the dataset mapping query_id to instruction (see here for more details on what that looks like), change the class to InstructionRetrieval, and move these into to the InstructionRetrieval folder instead of the Retrieval folder.

orionw · 2025-06-03T19:29:49Z

mteb/tasks/Retrieval/eng/IFIRCdsRetrieval.py

+        },
+        description="BBenchmark IFIR cds subset within instruction following abilities.",
+        reference="https://arxiv.org/abs/2503.04644",
+        type="Retrieval",


This should be changed to InstructionRetrieval

I saw there is a comment, that AbsTaskInstructionRetrieval will be merged with Retrieval in v2.0.0. So do I still need move it to the InstructRetrieval folder.

Ah I see - I thought this was already going into v2. We have all the updates to instruction retrieval there, which could be causing some confusion.

@KennethEnevoldsen does it make sense to put this in v2? We are switching soon anyways IIUC.

to create a separate split of the dataset mapping query_id to instruction

Should it be similar to InstructIR in InstructIR-mteb.

Yes, same format (two columns) although it will differ according to your setup. They have 10 instructions for each query they aggregate into one score per query. So you could aggregate yours however you'd prefer.

Really what's important is that the number you get at the end is the score you want people using for your benchmark

SighingSnow · 2025-06-04T00:47:47Z

My understanding of IFIR is that there is one instruction per query but that instruction was done at three hierarchies, is that right

Yes, but we only provide three hierarchies in four subset(fiqa, aila, scifact_open and pm).

but I think it would make more sense as InstructionRetrieval of course.

Yes I will check it in more detail. Should I use make all these 7 datasets to InstructionRetrieval.

KennethEnevoldsen · 2025-06-08T19:52:24Z

Yeah, so I think it would be best to move this to v2. I plan to have it merged during the summer holidays. @Samoed would any issues be transferring this that I don't know of?

Samoed · 2025-06-08T19:56:10Z

I don't think there are any issues. The dataset is in the correct format and should be easy to integrate.

KennethEnevoldsen · 2025-06-09T06:09:24Z

Would moving it over work for you @SighingSnow?

SighingSnow · 2025-06-09T12:07:47Z

Would moving it over work for you

Yes, no problem. I will try to get some results and fix my dataset format as @orionw mentioned.

SighingSnow · 2025-06-10T15:02:20Z

Hey @orionw , should I follow the InstructIR in the v2.0.0 branch to format my data.
For 3 sub-tasks, I have only one level. For the other remaining 4 sub-tasks, I have 3 level of instructions. So may I include all the instructions in this dataset ?

orionw · 2025-06-10T15:20:43Z

For 3 sub-tasks, I have only one level. For the other remaining 4 sub-tasks, I have 3 level of instructions. So may I include all the instructions in this dataset ?

Just to confirm: you want separate evaluation numbers for each instruction setting? E.g. level 1 has a score of X, level 2 has a score of Y, etc. If this is case I think the easiest way to do:

A) to create three datasets on HF when there are three versions of the instructions (with same everything except different instructions) and put three versions of it in the task file that link to those three datasets (like this one that has two versions). The downside is you have to do a bit of duplication on HF since the corpora/qrels/queries are the same and only the instructions differ.

You could also do (more complicated but useful if you have mixed metrics):

B) If you want mixed numbers (like the microaverage of two versions of the instructions) you can do it by duplicating the queries with some identifier (I.e. query_23_instruction_v1 and query_23_instruction_v2) and then do some aggregation in the end with the task_specific_scores function (e.g. if "_v1" in query calculate v1 score, OR if "_v1" or "_v2" in queries and "_v3" not in queries etc.). This is a bit more complicated, but can let you do arbitrary things with the scores.

SighingSnow · 2025-06-10T15:30:25Z

Got it. I will take the second choice. Hope I can finish this job this weekend. Thank you for your timely reply and help!

SighingSnow · 2025-06-11T13:30:39Z

@KennethEnevoldsen Sorry for the bothering, should I rebase my code based on the 2.0.0 branch ?
I mean, if I use the main branch code. My class will derived from AbsInstructionRetrieval class. This class is designed speficially for FollowIR.

KennethEnevoldsen · 2025-06-11T18:41:50Z

@SighingSnow changed the base to v2.0.0 you should be able to simpy git merge origin v2.0.0

SighingSnow · 2025-06-12T07:48:24Z

Thanks for all your patience! I have created a new PR on the branch v2.0.0. Looking forward to receive your feedback.
And I am glad to take your comments then.

Add IFIR relevant tasks.

998beac

Signed-off-by: SighingSnow <songtingyu220@gmail.com>

KennethEnevoldsen changed the title ~~Add IFIR relevant tasks.~~ fix: Add IFIR relevant tasks Jun 3, 2025

KennethEnevoldsen added the new dataset Issues related to adding a new task or dataset label Jun 3, 2025

KennethEnevoldsen reviewed Jun 3, 2025

View reviewed changes

orionw reviewed Jun 3, 2025

View reviewed changes

KennethEnevoldsen changed the base branch from main to v2.0.0 June 11, 2025 18:41

SighingSnow mentioned this pull request Jun 12, 2025

dataset: Add IFIR benchmark #2815

Merged

6 tasks

SighingSnow closed this Jun 12, 2025

fix: Add IFIR relevant tasks #2763

fix: Add IFIR relevant tasks #2763

Uh oh!

Conversation

SighingSnow commented Jun 3, 2025

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orionw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SighingSnow commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen commented Jun 8, 2025

Uh oh!

Samoed commented Jun 8, 2025

Uh oh!

KennethEnevoldsen commented Jun 9, 2025

Uh oh!

SighingSnow commented Jun 9, 2025

Uh oh!

SighingSnow commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionw commented Jun 10, 2025

Uh oh!

SighingSnow commented Jun 10, 2025

Uh oh!

SighingSnow commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen commented Jun 11, 2025

Uh oh!

SighingSnow commented Jun 12, 2025

Uh oh!

Uh oh!

orionw left a comment •

edited

Loading

SighingSnow commented Jun 4, 2025 •

edited

Loading

SighingSnow commented Jun 10, 2025 •

edited

Loading

SighingSnow commented Jun 11, 2025 •

edited

Loading