Skip to content

Conversation

SighingSnow
Copy link

  • I have outlined why this dataset is filling an existing gap in mteb

  • I have tested that the dataset runs with the mteb package.

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.

    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-smal
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).

Signed-off-by: SighingSnow <songtingyu220@gmail.com>
@KennethEnevoldsen KennethEnevoldsen changed the title Add IFIR relevant tasks. fix: Add IFIR relevant tasks Jun 3, 2025
@KennethEnevoldsen KennethEnevoldsen added the new dataset Issues related to adding a new task or dataset label Jun 3, 2025
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations on the paper and thanks for the PR! There seems to be a few missing things but nothing major. Running a couple of models on the benchmark would be great.

I examined a bit of the data to figure out how you handled the instruction without using the instruction retrieval task, but found a few odd cases:

How Beans Help Our Bones I am a student researching the impact [...]

I suspect it is:

Query: How Beans Help Our Bones
Instruction: I am a student researching the impact [...]

The casing also seems like a (consistent) formatting issue.

@orionw since it is instruction following I just want to get your eyes on this as well.

eval_splits=["test"],
eval_langs=["eng-Latn"],
main_score="ndcg_at_20",
date=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a bunch of annotations missing. These will need to be filled out

"path": "if-ir/cds",
"revision": "c406767",
},
description="BBenchmark IFIR cds subset within instruction following abilities.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description has to be detailed enough for me to get a gist of what the dataset contains.

@@ -274,6 +274,33 @@
""",
)

MTEB_RETRIEVAL_WITH_DOMAIN_INSTRUCTIONS = Benchmark(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we create the benchmark you will have to submit some models that have been evaluated on it as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure. I will further run some models on it.

eval_langs=["eng-Latn"],
main_score="ndcg_at_20",
date=None,
# domains=["Medical", "Academic", "Written"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, do annotate domains

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, no problem.

Copy link
Contributor

@orionw orionw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome to see this, thanks @SighingSnow!

I examined a bit of the data to figure out how you handled the instruction without using the instruction retrieval task, but found a few odd cases:

My understanding of IFIR is that there is one instruction per query but that instruction was done at three hierarchies, is that right @SighingSnow? I assume the most expansive/longest query is the one you want other people to evaluate on, but please correct me if I'm wrong.

If there is only one instruction per query, it could be formatted as Retrieval (they share the same underlying class AbstractRetrieval) but I think it would make more sense as InstructionRetrieval of course.

I think there are just a few minor things you need for that: to create a separate split of the dataset mapping query_id to instruction (see here for more details on what that looks like), change the class to InstructionRetrieval, and move these into to the InstructionRetrieval folder instead of the Retrieval folder.

},
description="BBenchmark IFIR cds subset within instruction following abilities.",
reference="https://arxiv.org/abs/2503.04644",
type="Retrieval",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be changed to InstructionRetrieval

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw there is a comment, that AbsTaskInstructionRetrieval will be merged with Retrieval in v2.0.0. So do I still need move it to the InstructRetrieval folder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - I thought this was already going into v2. We have all the updates to instruction retrieval there, which could be causing some confusion.

@KennethEnevoldsen does it make sense to put this in v2? We are switching soon anyways IIUC.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to create a separate split of the dataset mapping query_id to instruction

Should it be similar to InstructIR in InstructIR-mteb.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, same format (two columns) although it will differ according to your setup. They have 10 instructions for each query they aggregate into one score per query. So you could aggregate yours however you'd prefer.

Really what's important is that the number you get at the end is the score you want people using for your benchmark

@SighingSnow
Copy link
Author

SighingSnow commented Jun 4, 2025

My understanding of IFIR is that there is one instruction per query but that instruction was done at three hierarchies, is that right

Yes, but we only provide three hierarchies in four subset(fiqa, aila, scifact_open and pm).

but I think it would make more sense as InstructionRetrieval of course.

Yes I will check it in more detail. Should I use make all these 7 datasets to InstructionRetrieval.

@KennethEnevoldsen
Copy link
Contributor

Yeah, so I think it would be best to move this to v2. I plan to have it merged during the summer holidays. @Samoed would any issues be transferring this that I don't know of?

@Samoed
Copy link
Member

Samoed commented Jun 8, 2025

I don't think there are any issues. The dataset is in the correct format and should be easy to integrate.

@KennethEnevoldsen
Copy link
Contributor

Would moving it over work for you @SighingSnow?

@SighingSnow
Copy link
Author

Would moving it over work for you

Yes, no problem. I will try to get some results and fix my dataset format as @orionw mentioned.

@SighingSnow
Copy link
Author

SighingSnow commented Jun 10, 2025

Hey @orionw , should I follow the InstructIR in the v2.0.0 branch to format my data.
For 3 sub-tasks, I have only one level. For the other remaining 4 sub-tasks, I have 3 level of instructions. So may I include all the instructions in this dataset ?

@orionw
Copy link
Contributor

orionw commented Jun 10, 2025

For 3 sub-tasks, I have only one level. For the other remaining 4 sub-tasks, I have 3 level of instructions. So may I include all the instructions in this dataset ?

Just to confirm: you want separate evaluation numbers for each instruction setting? E.g. level 1 has a score of X, level 2 has a score of Y, etc. If this is case I think the easiest way to do:

A) to create three datasets on HF when there are three versions of the instructions (with same everything except different instructions) and put three versions of it in the task file that link to those three datasets (like this one that has two versions). The downside is you have to do a bit of duplication on HF since the corpora/qrels/queries are the same and only the instructions differ.

You could also do (more complicated but useful if you have mixed metrics):

B) If you want mixed numbers (like the microaverage of two versions of the instructions) you can do it by duplicating the queries with some identifier (I.e. query_23_instruction_v1 and query_23_instruction_v2) and then do some aggregation in the end with the task_specific_scores function (e.g. if "_v1" in query calculate v1 score, OR if "_v1" or "_v2" in queries and "_v3" not in queries etc.). This is a bit more complicated, but can let you do arbitrary things with the scores.

@SighingSnow
Copy link
Author

Got it. I will take the second choice. Hope I can finish this job this weekend. Thank you for your timely reply and help!

@SighingSnow
Copy link
Author

SighingSnow commented Jun 11, 2025

@KennethEnevoldsen Sorry for the bothering, should I rebase my code based on the 2.0.0 branch ?
I mean, if I use the main branch code. My class will derived from AbsInstructionRetrieval class. This class is designed speficially for FollowIR.

@KennethEnevoldsen KennethEnevoldsen changed the base branch from main to v2.0.0 June 11, 2025 18:41
@KennethEnevoldsen
Copy link
Contributor

@SighingSnow changed the base to v2.0.0 you should be able to simpy git merge origin v2.0.0

@SighingSnow SighingSnow mentioned this pull request Jun 12, 2025
6 tasks
@SighingSnow
Copy link
Author

Thanks for all your patience! I have created a new PR on the branch v2.0.0. Looking forward to receive your feedback.
And I am glad to take your comments then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new dataset Issues related to adding a new task or dataset
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants