-
Notifications
You must be signed in to change notification settings - Fork 462
fix: Add IFIR relevant tasks #2763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: SighingSnow <songtingyu220@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congratulations on the paper and thanks for the PR! There seems to be a few missing things but nothing major. Running a couple of models on the benchmark would be great.
I examined a bit of the data to figure out how you handled the instruction without using the instruction retrieval task, but found a few odd cases:
How Beans Help Our Bones I am a student researching the impact [...]
I suspect it is:
Query: How Beans Help Our Bones
Instruction: I am a student researching the impact [...]
The casing also seems like a (consistent) formatting issue.
@orionw since it is instruction following I just want to get your eyes on this as well.
eval_splits=["test"], | ||
eval_langs=["eng-Latn"], | ||
main_score="ndcg_at_20", | ||
date=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still a bunch of annotations missing. These will need to be filled out
"path": "if-ir/cds", | ||
"revision": "c406767", | ||
}, | ||
description="BBenchmark IFIR cds subset within instruction following abilities.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description has to be detailed enough for me to get a gist of what the dataset contains.
@@ -274,6 +274,33 @@ | |||
""", | |||
) | |||
|
|||
MTEB_RETRIEVAL_WITH_DOMAIN_INSTRUCTIONS = Benchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we create the benchmark you will have to submit some models that have been evaluated on it as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure. I will further run some models on it.
eval_langs=["eng-Latn"], | ||
main_score="ndcg_at_20", | ||
date=None, | ||
# domains=["Medical", "Academic", "Written"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, do annotate domains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome to see this, thanks @SighingSnow!
I examined a bit of the data to figure out how you handled the instruction without using the instruction retrieval task, but found a few odd cases:
My understanding of IFIR is that there is one instruction per query but that instruction was done at three hierarchies, is that right @SighingSnow? I assume the most expansive/longest query is the one you want other people to evaluate on, but please correct me if I'm wrong.
If there is only one instruction per query, it could be formatted as Retrieval (they share the same underlying class AbstractRetrieval
) but I think it would make more sense as InstructionRetrieval
of course.
I think there are just a few minor things you need for that: to create a separate split of the dataset mapping query_id
to instruction
(see here for more details on what that looks like), change the class to InstructionRetrieval
, and move these into to the InstructionRetrieval folder instead of the Retrieval
folder.
}, | ||
description="BBenchmark IFIR cds subset within instruction following abilities.", | ||
reference="https://arxiv.org/abs/2503.04644", | ||
type="Retrieval", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be changed to InstructionRetrieval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw there is a comment, that AbsTaskInstructionRetrieval
will be merged with Retrieval in v2.0.0. So do I still need move it to the InstructRetrieval
folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - I thought this was already going into v2. We have all the updates to instruction retrieval there, which could be causing some confusion.
@KennethEnevoldsen does it make sense to put this in v2? We are switching soon anyways IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to create a separate split of the dataset mapping query_id to instruction
Should it be similar to InstructIR in InstructIR-mteb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, same format (two columns) although it will differ according to your setup. They have 10 instructions for each query they aggregate into one score per query. So you could aggregate yours however you'd prefer.
Really what's important is that the number you get at the end is the score you want people using for your benchmark
Yes, but we only provide three hierarchies in four subset(fiqa, aila, scifact_open and pm).
Yes I will check it in more detail. Should I use make all these 7 datasets to InstructionRetrieval. |
Yeah, so I think it would be best to move this to v2. I plan to have it merged during the summer holidays. @Samoed would any issues be transferring this that I don't know of? |
I don't think there are any issues. The dataset is in the correct format and should be easy to integrate. |
Would moving it over work for you @SighingSnow? |
Yes, no problem. I will try to get some results and fix my dataset format as @orionw mentioned. |
Hey @orionw , should I follow the InstructIR in the v2.0.0 branch to format my data. |
Just to confirm: you want separate evaluation numbers for each instruction setting? E.g. level 1 has a score of X, level 2 has a score of Y, etc. If this is case I think the easiest way to do: A) to create three datasets on HF when there are three versions of the instructions (with same everything except different instructions) and put three versions of it in the task file that link to those three datasets (like this one that has two versions). The downside is you have to do a bit of duplication on HF since the corpora/qrels/queries are the same and only the instructions differ. You could also do (more complicated but useful if you have mixed metrics): B) If you want mixed numbers (like the microaverage of two versions of the instructions) you can do it by duplicating the queries with some identifier (I.e. |
Got it. I will take the second choice. Hope I can finish this job this weekend. Thank you for your timely reply and help! |
@KennethEnevoldsen Sorry for the bothering, should I rebase my code based on the 2.0.0 branch ? |
@SighingSnow changed the base to v2.0.0 you should be able to simpy |
Thanks for all your patience! I have created a new PR on the branch v2.0.0. Looking forward to receive your feedback. |
I have outlined why this dataset is filling an existing gap in mteb
I have tested that the dataset runs with the mteb package.
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).