Skip to content

Conversation

nikolay-banar
Copy link
Contributor

We recently published BEIR-NL, which is a Dutch translated version of BEIR.

Adding datasets checklist

Reason for dataset addition: BEIR-NL, a new benchmark for retrieval in Dutch.

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

@nikolay-banar nikolay-banar changed the title BEIR-NL Add new benchmark BEIR-NL Jan 30, 2025
@Samoed Samoed changed the title Add new benchmark BEIR-NL feat: Add new benchmark BEIR-NL Jan 30, 2025
Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to add your benchmark to benchmark file

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to have BEIR-NL added!

I have noted a few additional pointers - @Samoed shouldn't we have a test fail here since the descriptive statistics are missing?

domains=["Written"],
task_subtypes=[],
license="cc-by-sa-4.0",
annotations_creators="LM-generated and reviewed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this "derived" from the english data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the "derived" category would fit better. I will change that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

license="cc-by-sa-4.0",
annotations_creators="LM-generated and reviewed",
dialect=[],
sample_creation="machine-translated and verified",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How were these verified?

Copy link
Contributor Author

@nikolay-banar nikolay-banar Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We manually checked a small subset of translations. If it does not fit into the "verified" category, I can remove that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that is perfectly fine - will you add a comment like so:

Suggested change
sample_creation="machine-translated and verified",
sample_creation="machine-translated and verified", # manually checked a small subset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few additional pointers on metadata

Since nothing has been run on these tasks the leaderboard will appear empty. You might consider submitting scores at least for a relevant set of models. If you don't have the resources for this I will have to figure out how we handle an empty benchmark (it might give a bug or lead to confusion on the leaderboard)

BEIR_NL = Benchmark(
name="BEIR-NL",
tasks=get_tasks(
tasks=[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great - I can see that many models are trained on the English version of these datasets and since they have not been annotated as trained on these datasets they will appear as zero-shot on BEIR-NL (despite being trained on e.g. FEVER). To avoid this you would need to update the model annotations (searching for "NQ"´, "FEVER"` etc. should allow you to find the relevant cases and update the annotations)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted some results to embeddings-benchmark/results#105 and updated the model annotations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged!

eval_splits=["test"],
eval_langs=["nld-Latn"],
main_score="ndcg_at_10",
date=("2024-10-01", "2024-10-01"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second round through the annotations I see that the dates do not quite match the time of the original data (Is it the time of translation?)

I checked the previous translated dataset and there we annotated the data range of the source data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to give your best guess here but simply annotate # best guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, these dates were the time of translation.

eval_langs=["nld-Latn"],
main_score="ndcg_at_10",
date=("2024-10-01", "2024-10-01"),
domains=["Written"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the domains are only minimally filled out. It seems like at least "Non-fiction" would apply to many of these.

nikolay-banar and others added 6 commits January 30, 2025 21:53
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Looks like all comments have been addressed, just need to resolve the merge conflicts. @nikolay-banar would be great to have this merged into the repo soon!

"NQHardNegatives": ["test"],
"HotPotQA": ["test"],
"HotPotQAHardNegatives": ["test"],
"HotPotQA-PL": ["test"], # translated from hotpotQA (not trained on)
"HotpotQA-NL": ["test"], # translated from hotpotQA (not trained on)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KennethEnevoldsen if these are "not trained on", should we still keep these? Personally I find these very confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think translating a dataset and training on it should still lead to a non-zero-shot on the benchmark - these are just to annotate that. We could "link" the tasks and update leaderboard code (but currently that is not how it is done)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes sense, thanks. The "not trained on" part was confusing for me. Maybe it could have said something closer to "trained on translation" in the future?

@nikolay-banar
Copy link
Contributor Author

@isaac-chung Some tests are failed, but that doesn't seem to be related to my code.

@isaac-chung
Copy link
Collaborator

@nikolay-banar thanks! I'm rerunning them now.
@KennethEnevoldsen just wanted to see if you're happy with the updates. I'm happy to merge once CI passes. I think we can handle descriptive stats until v2 is merged.

@nikolay-banar
Copy link
Contributor Author

@isaac-chung @KennethEnevoldsen @Samoed Thank you for your reviews!

BEIR_NL = Benchmark(
name="BEIR-NL",
tasks=get_tasks(
tasks=[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged!

@KennethEnevoldsen
Copy link
Contributor

everything is good on my end - so will merge this in

@KennethEnevoldsen KennethEnevoldsen merged commit de8f384 into embeddings-benchmark:main Feb 4, 2025
11 checks passed
@nikolay-banar
Copy link
Contributor Author

@KennethEnevoldsen I have noticed a small bug in SCIDOCSNLRetrieval.py with eval_langs (it should be ["nld-Latn"]). Should I open a new issue for that?

@Samoed
Copy link
Member

Samoed commented Feb 5, 2025

You can create PR with fix

@nikolay-banar nikolay-banar deleted the beirnl-branch branch July 2, 2025 14:33
@EwoutH
Copy link

EwoutH commented Aug 1, 2025

This is awesome work!

In the “Language-specific” section of the sidebar on the leaderboard, there isn’t currently a filter for Dutch. Could one be added?

@EwoutH EwoutH mentioned this pull request Aug 1, 2025
8 tasks
@isaac-chung
Copy link
Collaborator

Beir-NL is currently available under miscellaneous:
Screenshot_20250802-093156.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants