Skip to content

fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression #2716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

AlexeyVatolin
Copy link
Contributor

@AlexeyVatolin AlexeyVatolin commented May 22, 2025

Add RuSciBench datasets with scientific tasks on Russian and English from Russian scientific electronic library elibrary.ru

Here is out paper:
https://link.springer.com/article/10.1134/S1064562424602191

Checklist

  • I did not add a dataset, or if I did, I added the dataset checklist to the PR and completed it.

  • I did not add a model, or if I did, I added the model checklist to the PR and completed it.

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.

    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).

  • I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)

model_name task_name languages main_score
multilingual-e5-small RuSciBenchBitexMining eng-Latn,rus-Cyrl 0.978372
paraphrase-multilingual-MiniLM-L12-v2 RuSciBenchBitexMining eng-Latn,rus-Cyrl 0.945861
multilingual-e5-small RuSciBenchBitexMining rus-Cyrl,eng-Latn 0.974774
paraphrase-multilingual-MiniLM-L12-v2 RuSciBenchBitexMining rus-Cyrl,eng-Latn 0.929254
multilingual-e5-small RuSciBenchCiteRetrieval eng-Latn 0.25836
paraphrase-multilingual-MiniLM-L12-v2 RuSciBenchCiteRetrieval eng-Latn 0.23692
multilingual-e5-small RuSciBenchCiteRetrieval rus-Cyrl 0.28923
paraphrase-multilingual-MiniLM-L12-v2 RuSciBenchCiteRetrieval rus-Cyrl 0.18175
multilingual-e5-small RuSciBenchCociteRetrieval eng-Latn 0.21956
paraphrase-multilingual-MiniLM-L12-v2 RuSciBenchCociteRetrieval eng-Latn 0.2035
multilingual-e5-small RuSciBenchCociteRetrieval rus-Cyrl 0.24766
paraphrase-multilingual-MiniLM-L12-v2 RuSciBenchCociteRetrieval rus-Cyrl 0.15751
multilingual-e5-small RuSciBenchCoreRiscClassification eng-Latn 0.594057
paraphrase-multilingual-MiniLM-L12-v2 RuSciBenchCoreRiscClassification eng-Latn 0.578581
multilingual-e5-small RuSciBenchCoreRiscClassification rus-Cyrl 0.594652
paraphrase-multilingual-MiniLM-L12-v2 RuSciBenchCoreRiscClassification rus-Cyrl 0.580301
multilingual-e5-small RuSciBenchPubTypeClassification eng-Latn 0.345671
paraphrase-multilingual-MiniLM-L12-v2 RuSciBenchPubTypeClassification eng-Latn 0.317749
multilingual-e5-small RuSciBenchPubTypeClassification rus-Cyrl 0.361472
paraphrase-multilingual-MiniLM-L12-v2 RuSciBenchPubTypeClassification rus-Cyrl 0.321645

@AlexeyVatolin AlexeyVatolin marked this pull request as draft May 22, 2025 22:41
@AlexeyVatolin AlexeyVatolin marked this pull request as ready for review May 22, 2025 22:51
Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations on the publication of your paper! Can you also add your benchmark to bencharks.py?

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadata generally looks good. Though maybe the descriptions could use a slight improvement

@isaac-chung
Copy link
Collaborator

@AlexeyVatolin would love to get this in, if you're still working on this!

@KennethEnevoldsen
Copy link
Contributor

It seems like this has gotten stale - it's close enough that we could finish it. @Samoed I suppose we could solve the load_data issue simply using v2 and then we are basically there

@AlexeyVatolin
Copy link
Contributor Author

AlexeyVatolin commented Jul 13, 2025

As @Samoed mentioned, I have added RuSciBench to the list of benchmarks. There is an issue with the GRNTI and OECD classification tasks: they were previously added as part of the RuMTEB benchmark, but only in Russian. To avoid name conflicts, I added "Orig" to the names (RuSciBenchGRNTIOrigClassification, RuSciBenchOECDOrigClassification). I have checked and found that the data is sampled slightly differently, which is why the metric values for the tasks do not match in Russian.

@isaac-chung
Copy link
Collaborator

Thanks. I think in general this looks good. I'd like get @KennethEnevoldsen and @Samoed 's opinion on the added regression abstask before moving forward.

I added "Orig" to the names (RuSciBenchGRNTIOrigClassification, RuSciBenchOECDOrigClassification)

Let's add superseded_by to the non-orig version of the tasks as well? e.g.

  • add superseded_by="RuSciBenchOECDOrigClassification" to RuSciBenchOECDClassification

@AlexeyVatolin AlexeyVatolin requested a review from Samoed July 16, 2025 19:56
@AlexeyVatolin AlexeyVatolin requested a review from Samoed July 17, 2025 18:07
@Samoed Samoed requested a review from KennethEnevoldsen July 17, 2025 21:17
@Samoed
Copy link
Member

Samoed commented Jul 17, 2025

Great work!

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focused mostly on the regression tasks - generally everything looks good, but had a few minor changes to add.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more changes, but otherwise I think we are good to merge

Comment on lines 17 to 19
class RegressorModel(Protocol):
def fit(self, X, y, sample_weight=None): ...
def predict(self, X): ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use the RegressorMixin, but that might a bit harder for the user so I would import it as:

from sklearn.base import RegressorMixin as SklearnRegressorModel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RegressorMixin class has only the score method, which is not used in my code. If we use it, the LinearRegressionEvaluator class will encounter the following error: Cannot access attribute "fit" for class "RegressorMixin".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, that is annoying..., but I see the reason for using this approach then. Let's rename it to SklearnRegressorModel (just to clarify that it is a Sklearn compatible model)

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the issue with the regressor typing is fixed then this is good to merge

@AlexeyVatolin
Copy link
Contributor Author

@KennethEnevoldsen, Could you please take a look at the pull request when you have a moment? Thank you!

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Sorry for being slow to respond, I was at conference (ACL) last week

@KennethEnevoldsen KennethEnevoldsen changed the title Add RuSciBench fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression Aug 2, 2025
@KennethEnevoldsen KennethEnevoldsen merged commit 36df9ca into embeddings-benchmark:main Aug 2, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants