-
Notifications
You must be signed in to change notification settings - Fork 463
feat: Add new benchmark BEIR-NL #1909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add new benchmark BEIR-NL #1909
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to add your benchmark to benchmark file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to have BEIR-NL added!
I have noted a few additional pointers - @Samoed shouldn't we have a test fail here since the descriptive statistics are missing?
domains=["Written"], | ||
task_subtypes=[], | ||
license="cc-by-sa-4.0", | ||
annotations_creators="LM-generated and reviewed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this "derived" from the english data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the "derived" category would fit better. I will change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
license="cc-by-sa-4.0", | ||
annotations_creators="LM-generated and reviewed", | ||
dialect=[], | ||
sample_creation="machine-translated and verified", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How were these verified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We manually checked a small subset of translations. If it does not fit into the "verified" category, I can remove that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that is perfectly fine - will you add a comment like so:
sample_creation="machine-translated and verified", | |
sample_creation="machine-translated and verified", # manually checked a small subset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few additional pointers on metadata
Since nothing has been run on these tasks the leaderboard will appear empty. You might consider submitting scores at least for a relevant set of models. If you don't have the resources for this I will have to figure out how we handle an empty benchmark (it might give a bug or lead to confusion on the leaderboard)
BEIR_NL = Benchmark( | ||
name="BEIR-NL", | ||
tasks=get_tasks( | ||
tasks=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great - I can see that many models are trained on the English version of these datasets and since they have not been annotated as trained on these datasets they will appear as zero-shot on BEIR-NL (despite being trained on e.g. FEVER). To avoid this you would need to update the model annotations (searching for "NQ"´,
"FEVER"` etc. should allow you to find the relevant cases and update the annotations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted some results to embeddings-benchmark/results#105 and updated the model annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged!
eval_splits=["test"], | ||
eval_langs=["nld-Latn"], | ||
main_score="ndcg_at_10", | ||
date=("2024-10-01", "2024-10-01"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second round through the annotations I see that the dates do not quite match the time of the original data (Is it the time of translation?)
I checked the previous translated dataset and there we annotated the data range of the source data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to give your best guess here but simply annotate # best guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, these dates were the time of translation.
eval_langs=["nld-Latn"], | ||
main_score="ndcg_at_10", | ||
date=("2024-10-01", "2024-10-01"), | ||
domains=["Written"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the domains are only minimally filled out. It seems like at least "Non-fiction" would apply to many of these.
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Looks like all comments have been addressed, just need to resolve the merge conflicts. @nikolay-banar would be great to have this merged into the repo soon!
"NQHardNegatives": ["test"], | ||
"HotPotQA": ["test"], | ||
"HotPotQAHardNegatives": ["test"], | ||
"HotPotQA-PL": ["test"], # translated from hotpotQA (not trained on) | ||
"HotpotQA-NL": ["test"], # translated from hotpotQA (not trained on) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KennethEnevoldsen if these are "not trained on", should we still keep these? Personally I find these very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think translating a dataset and training on it should still lead to a non-zero-shot on the benchmark - these are just to annotate that. We could "link" the tasks and update leaderboard code (but currently that is not how it is done)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, makes sense, thanks. The "not trained on" part was confusing for me. Maybe it could have said something closer to "trained on translation" in the future?
@isaac-chung Some tests are failed, but that doesn't seem to be related to my code. |
@nikolay-banar thanks! I'm rerunning them now. |
@isaac-chung @KennethEnevoldsen @Samoed Thank you for your reviews! |
BEIR_NL = Benchmark( | ||
name="BEIR-NL", | ||
tasks=get_tasks( | ||
tasks=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged!
everything is good on my end - so will merge this in |
@KennethEnevoldsen I have noticed a small bug in SCIDOCSNLRetrieval.py with eval_langs (it should be ["nld-Latn"]). Should I open a new issue for that? |
You can create PR with fix |
This is awesome work! In the “Language-specific” section of the sidebar on the leaderboard, there isn’t currently a filter for Dutch. Could one be added? |
We recently published BEIR-NL, which is a Dutch translated version of BEIR.
Adding datasets checklist
Reason for dataset addition: BEIR-NL, a new benchmark for retrieval in Dutch.
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.