Skip to content

Conversation

q275343119
Copy link
Contributor

@q275343119 q275343119 commented Aug 18, 2025

Closes #3009
Combine Plots and Tables into a Single

The order :

  • Summary
  • Performance by Model size
  • Performance by Task Type
  • Performance per Task
  • Task information

@Samoed
Copy link
Member

Samoed commented Aug 18, 2025

Do you have a hosted in spaces version of your changes?

@q275343119
Copy link
Contributor Author

I will host a space and share the url with you later

@q275343119
Copy link
Contributor Author

Do you have a hosted in spaces version of your changes?

Here: https://huggingface.co/spaces/q275343119/leaderboard

@Samoed
Copy link
Member

Samoed commented Aug 18, 2025

This plot a bit strange. Can you fix it?
image

@q275343119
Copy link
Contributor Author

This plot a bit strange. Can you fix it? image

I see. The size doesn’t seem to be fixed — sometimes it fills the whole tab, and sometimes it doesn’t. I’ll try to look into this and fix it.

@KennethEnevoldsen
Copy link
Contributor

Let us also remove "(radar chart)" and resize the plot to make it more readable. Also, could you include a comment stating that it only shows the top 5 models in the table?

We also discussed merging cite and share into one (otherwise we will have too many tabs)

It still looks weird with the three bars on top of each other (is there any chance they could be side by side?)

@q275343119
Copy link
Contributor Author

Let us also remove "(radar chart)" and resize the plot to make it more readable. Also, could you include a comment stating that it only shows the top 5 models in the table?

We also discussed merging cite and share into one (otherwise we will have too many tabs)

It still looks weird with the three bars on top of each other (is there any chance they could be side by side?)

ok.In this PR I plan to address the following:

  • Remove the (radar chart)
  • Resize the plot to make it more readable
  • (For discussion) 'Add a comment stating that it only shows the Top 5 models in the table.' I propose changing it to: “We only display the Top 5 models that have been run on all tasks in the benchmark.” Does this sound okay?

If we want to merge “cite” and “share” into a single tab, that can be handled in another PR.

@q275343119
Copy link
Contributor Author

Let us also remove "(radar chart)" and resize the plot to make it more readable. Also, could you include a comment stating that it only shows the top 5 models in the table?让我们也删除“(雷达图)”并调整图大小以使其更具可读性。另外,您能否附上一条评论,说明它只显示表格中的前 5 个模型?

We also discussed merging cite and share into one (otherwise we will have too many tabs)我们还讨论了将引用和分享合并为一个(否则我们会有太多选项卡)

It still looks weird with the three bars on top of each other (is there any chance they could be side by side?)三个条形图叠在一起看起来仍然很奇怪(它们有可能并排放置吗?

image Would this layout look better? @KennethEnevoldsen

@KennethEnevoldsen
Copy link
Contributor

Can you push the leaderboard so I can check?

Generally, I would say it looks better, but I would probably format the citation and share box a bit (e.g. adding a bit of text "to cite this work please ...")

I would also close all of them by default and move the customized benchmark to the right (but it is a bit hard to see if that will be better)

@q275343119
Copy link
Contributor Author

Can you push the leaderboard so I can check?

Generally, I would say it looks better, but I would probably format the citation and share box a bit (e.g. adding a bit of text "to cite this work please ...")

I would also close all of them by default and move the customized benchmark to the right (but it is a bit hard to see if that will be better)

Please check the updated version here: https://huggingface.co/spaces/q275343119/leaderboard

@q275343119
Copy link
Contributor Author

Can you push the leaderboard so I can check?
Generally, I would say it looks better, but I would probably format the citation and share box a bit (e.g. adding a bit of text "to cite this work please ...")
I would also close all of them by default and move the customized benchmark to the right (but it is a bit hard to see if that will be better)

Please check the updated version here: https://huggingface.co/spaces/q275343119/leaderboard

@KennethEnevoldsen

@KennethEnevoldsen
Copy link
Contributor

Looks great on my end - sorry for the slow reply (@imenelydiaker or @Samoed, do you guys have time to look this over as well)

@KennethEnevoldsen
Copy link
Contributor

Fixed linting issue and merged from main - assuming that other maintainers agree with this PR, then I think it is good to merge

@Samoed
Copy link
Member

Samoed commented Aug 22, 2025

I think it's a bit strange that we have in same column cite and share and search filters. It becomes a bit hidden for users
image

@KennethEnevoldsen
Copy link
Contributor

@Samoed, what do you suggest?

@Samoed
Copy link
Member

Samoed commented Aug 22, 2025

We can leave them as they in current leaderboard
image

@KennethEnevoldsen
Copy link
Contributor

That would be three on top of eachother (for me that seems a bit much), I have a slight preference for the new format (@isaac-chung or @imenelydiaker do you guys have a preference?)

@Samoed
Copy link
Member

Samoed commented Aug 26, 2025

Maybe not on top of each other, but they should be on full width of the table. It can be also cahgned as in this screenshot #3047 (comment)

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the updated version here: https://huggingface.co/spaces/q275343119/leaderboard

Looks good. On my laptop, the LB now fits in a single screenshot, which is a nice bonus.

@imenelydiaker
Copy link
Contributor

LGTM. I think it's good to have the three blocks all in the same corner (don't forget to remove the : after "... share this benchmark").

image

Btw, "Click for more info" link is not working for me.

@Samoed
Copy link
Member

Samoed commented Aug 26, 2025

Btw, "Click for more info" link is not working for me.

Yes, we know about these issues and it seems that they're are on gradio/huggingface side #2955 #2869

@q275343119
Copy link
Contributor Author

I rebuild the space,and everyone can see the latest changes(: has been removed) at this URL: https://huggingface.co/spaces/q275343119/leaderboard

@KennethEnevoldsen
Copy link
Contributor

Sounds like this is ready to merge. @q275343119 took the liberty of fixing the merge conflict

@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) August 29, 2025 09:12
@KennethEnevoldsen
Copy link
Contributor

Seems to be failing due to #3097 (attempted fix in #3098)

@isaac-chung isaac-chung enabled auto-merge (squash) August 29, 2025 21:49
@isaac-chung isaac-chung merged commit 9586697 into embeddings-benchmark:main Aug 29, 2025
9 checks passed
Samoed added a commit that referenced this pull request Sep 1, 2025
* model: add image support for jina embeddings v4 (#2893)

* feat: unify text and image embeddings for all tasks

* fix: uniform batch size

* fix: update error message

* fix: update code task

* fix: update max length

* fix: apply review suggestions

* model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889)

* feat: add KaLM_Embedding_X_0605 in kalm_models

* Update kalm_models.py for lint format

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

---------

Co-authored-by: xinshuohu <xinshuohu@tencent.com>
Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com>

* Add Classification Evaluator unit test (#2838)

* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* fix: update colpali engine models (#2905)

* adding vidore benchmarks

* fix typo

* clean vidore names + per lang eval

* lint

* vidore names

* bibtex fix

* fix revision

* vidore v2 citation

* update citation format and fix per-language mappings

* lint: citations

* typo citations

* fix revisiions

* lint

* fix colnomic3b revision

* fix colqwen2.5 revision + latest repo version

* fix query agmentation tokens

* colsmol revision

* 1.38.35

Automatically generated by python-semantic-release

* Evaluator tests (#2910)

* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

* Adding STSEvaluator and SummarizationEvaluator tests

* Correcting due to the comments

* Correcting due to the comments

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Classification dataset cleaning (#2900)

* Classification dataset cleaning

* Update pull request number

* Fix metadata test

* fix formatting

* add script for cleaning

* Update tasks & benchmarks tables

* dataset: Add JapaneseSentimentClassification (#2913)

Add JapaneseSentimentClassification

* Update tasks & benchmarks tables

* fix: change `passage` prompt to `document`  (#2912)

* change document to passage

* fix prompt names

* fix kwargs check

* fix default prompt

* 1.38.36

Automatically generated by python-semantic-release

* model: Add OpenSearch inf-free sparse encoding models (#2903)

add opensearch inf-free models

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* dataset: add BarExamQA dataset (#2916)

* Add BareExamQA retrieval task

* ran linter

* updated details

* updated details

* fixed subtype name

* fixed changes

* ran linter again

* Use `mteb.get_model` in adding_a_dataset.md (#2922)

Update adding_a_dataset.md

* fix: specify revision for opensearch (#2919)

specify revision for opensearch

* 1.38.37

Automatically generated by python-semantic-release

* Update the link for gemini-embedding-001 (#2928)

* fix: replace with passage (#2934)

* fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940)

* fix: Only import SparseEncoder once sentence-transformer version have been checked

fixes #2936

* Update mteb/models/opensearch_neural_sparse_models.py

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939)

The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue.

* docs: Update adding_a_dataset.md (#2947)

* docs: Update adding_a_dataset.md

* Update docs/adding_a_dataset.md

* ci: bump semantic release

* 1.38.38

Automatically generated by python-semantic-release

* dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935)

* BSARD loader fixed

* BSARDv2 metadata fixed

* Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tasks & benchmarks tables

* dataset: add GovReport dataset (#2953)

* Added govreport task

* Updated description

* dataset: add BillSum datasets (#2943)

* Added BillSum datasets

* fixed billsumca

* Updated BillSumCA description

* Updated BillSumUS description

* Update mteb/tasks/Retrieval/eng/BillSumCA.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/BillSumUS.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* lint

* lint

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks & benchmarks tables

* fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716)

* Add RuSciBench

* fix bitext mining lang

* Add regression task

* fix init

* add missing files

* Improve description

* Add superseded_by

* fix lint

* Update regression task to match with v2

* Add stratified_subsampling for regression task

* Add boostrap for regression task

* Rename task class, add model as evaluator argument

* fix import

* fix import 2

* fixes

* fix

* Rename regression model protocol

* Update tasks & benchmarks tables

* 1.38.39

Automatically generated by python-semantic-release

* qzhou-embedding model_meta & implementation (#2975)

* qzhou-embedding model_meta & implementation

* Update qzhou_models.py

* Update qzhou_models.py

Processing todo items(Add default instruction)

* Update qzhou_models.py

correct bge datalist

* Update qzhou_models.py

correct 'public_training_data'

* Update qzhou_models.py

* Update qzhou_models.py

* Update qzhou_models.py

* Update qzhou_models.py

* Update mteb/models/qzhou_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/qzhou_models.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* format qzhou_models.py for ruff check

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* model: Add Voyage 3.5 model configuration (#3005)

Add Voyage 3.5 model configuration

- Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens
- Set release date to 2025-01-21 with revision 1
- Configure for cosine similarity with instruction support
- Include standard Voyage training datasets reference

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <noreply@anthropic.com>

* model: BAAI/bge-m3-unsupervised Model (#3007)

* Add BAAI/bge-m3-unsupervised Model
(BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out)

* Remove the commented retromae model

---------

Co-authored-by: fzowl <zoltan@voyageai.com>

* lint: Correcting lint errors (#3004)

* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

* Correcting the lint errors

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* dataset: Added 50 Vietnamese dataset from vn-mteb (#2964)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [REMOVE] default fields metadata in Classfication tasks

* Update tasks & benchmarks tables

* model: Add Cohere embed-v4.0 model support (#3006)

* Add Cohere embed-v4.0 model support

- Add text-only embed-v4.0 model in cohere_models.py
- Add multimodal embed-v4.0 model in cohere_v.py
- Support configurable dimensions (256, 512, 1024, 1536)
- Support 128,000 token context length
- Support multimodal embedding (text, images, mixed PDFs)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Add Cohere embed-v4.0 model support

Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Add OpenAI models with 512 dimension (#3008)

* Add OpenAI/text-embedding-3-small (512 dim)
Add OpenAI/text-embedding-3-large (512 dim)

* Correcting due to comments

---------

Co-authored-by: fzowl <zoltan@voyageai.com>

* Standardise task names and fix citation formatting (#3026)

fixes for name formatting

* Update tasks & benchmarks tables

* fix: Add missing training sets for qzhou (#3023)

* Supplement missing training sets

* reformat code

* Reorganize the data list format

* update qzhou_model meta

* 1.38.40

Automatically generated by python-semantic-release

* model: Add samilpwc_models meta (#3028)

* model: Add samilpwc_models meta

* Fix: Remove CONST

* Fix: Reformat File

* Update: model revision

* model: Add granite-vision-embedding model  (#3029)

* Add files via upload

* Address review comments

* Address review comments

* ruff format

* Update mteb/models/granite_vision_embedding_models.py

* lint error fix

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* fix: incorrect revision for SNLRetrieval (#3033)

The provided revisions doesn't seem to be present on:
adrlau/navjordj-SNL_summarization_copy

Replacing with latest revision

* dataset: Add HumanEvalRetrieval task (#3022)

* Add HumanEvalRetrieval dataset

* Fix TaskMetadata structure and remove descriptive_stats

- Use TaskMetadata class instead of dict
- Remove descriptive_stats as requested in PR review
- Add date field and proper import structure

* Fix dataset path and use verified metadata

- Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval
- Use actual description from HuggingFace dataset page
- Remove fabricated citation and reference
- Remove revision field that was incorrect
- Reference HuggingFace dataset page instead of arxiv

* Add correct revision hash to HumanEval

- Add revision hash: ed1f48a for reproducibility

* Fix HumanEval metadata validation

- Add date field for metadata completeness
- Add bibtex_citation field (empty string)
- Required for TaskMetadata validation to pass
- Should resolve PR test failure

* Address reviewer feedback

- Remove trust_remote_code parameter as requested
- Add revision parameter to load_dataset() calls for consistency
- Use metadata revision hash in dataset loading for reproducibility

* Fix field names in HumanEval dataset loading

Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format.

* Fix deprecated metadata_dict usage

Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility.

* Fix data structure for MTEB compatibility

- Organize data by splits as expected by MTEB retrieval tasks
- Convert scores to integers for pytrec_eval compatibility

* Address PR feedback for HumanEval dataset

- Add descriptive statistics using calculate_metadata_metrics()
- Enhance metadata description with dataset structure details
- Add complete BibTeX citation for original paper
- Update to full commit hash revision
- Add python-Code language tag for programming language
- Explain retrieval task formulation clearly

* Fix BibTeX citation formatting for HumanEvalRetrieval

- Update citation to match bibtexparser formatting requirements
- Fields now in alphabetical order with lowercase names
- Proper trailing commas and indentation

* Update tasks & benchmarks tables

* 1.38.41

Automatically generated by python-semantic-release

* ci: reduce parallel runs for when checking if a dataset exists (#3035)

The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831)

* ci: Updating rerun delays to prevent false positives errors

* ci: Updating rerun delays to prevent false positives errors

* model: Add GreenNode Vietnamese Embedding models (#2994)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [ADD] Vietnamese Embedding models

* [REMOVE] default fields metadata in Classfication tasks

* [UPDATE] model to vi-vn language specific file

* [FIX] lint

* [FIX] model loader

* model: add granite-embedding-english R2 models (#3050)

* fix: Updated revision for jina-embeddings-v4 (#3046)

* fix: jinav4 revision

Signed-off-by: admin <bo.wang@jina.ai>

* change revision instead of removing it

Signed-off-by: admin <bo.wang@jina.ai>

---------

Signed-off-by: admin <bo.wang@jina.ai>
Co-authored-by: admin <bo.wang@jina.ai>

* 1.38.42

Automatically generated by python-semantic-release

* Fix 3 VN-MTEB Pair Classification tasks (#3053)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [ADD] Vietnamese Embedding models

* [REMOVE] default fields metadata in Classfication tasks

* [UPDATE] model to vi-vn language specific file

* [FIX] lint

* [FIX] model loader

* [FIX] VN-MTEB 3 datasets PairClassification rename column

* dataset: Add mbpp retrieval (#3037)

* Add MBPP retrieval task

- Code retrieval task based on 378 Python programming problems
- Natural language queries matched to Python code implementations
- Uses python-Code evaluation language for code-specific metrics
- Includes proper citations and descriptive statistics

* Add MBPPRetrieval to imports

* Add descriptive statistics for MBPPRetrieval

* Reformatting

* Reformatting

* Update tasks & benchmarks tables

* dataset: Added wikisql retrieval (#3039)

* Add WikiSQL retrieval task

- Code retrieval task based on WikiSQL natural language to SQL dataset
- Natural language questions matched to SQL query implementations
- Uses sql-Code evaluation language for SQL-specific metrics
- Includes proper citations and descriptive statistics

* Add WikiSQLRetrieval to imports

* Add descriptive statistics for WikiSQLRetrieval

* Reformatting

* Reformatting

* Reformatting, correcting the revision

* Update tasks & benchmarks tables

* ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors

try to fix CI

* fix MBPPRetrieval revision (#3055)

Update MBPPRetrieval.py

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix: Add VN-MTEB benchmark and Leaderboard (#2995)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [ADD] VN-MTEB benchmark and leaderboard

* [FIX] wrong benchmark name

* [REMOVE] default fields metadata in Classfication tasks

* Update tasks & benchmarks tables

* 1.38.43

Automatically generated by python-semantic-release

* Add hc3finance retrieval (#3041)

* Add HC3Finance retrieval task

- Financial retrieval task based on HC3 Finance dataset
- Financial questions matched to human and AI-generated content
- Covers financial explanations, analysis, and educational content
- Includes proper citations and descriptive statistics

* Add HC3FinanceRetrieval to imports

* Add descriptive statistics for HC3FinanceRetrieval

* Reformatting

* Reformatting, correcting the revision

* Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Add finqa retrieval (#3042)

* Add FinQA retrieval task

- Financial numerical reasoning retrieval task based on FinQA dataset
- Numerical financial questions matched to relevant document data
- Covers earnings reports with tables and quantitative financial data
- Includes proper citations and descriptive statistics

* Add FinQARetrieval to imports

* Add descriptive statistics for FinQARetrieval

* Reformatting

* Reformatting

* Update mteb/tasks/Retrieval/eng/FinQARetrieval.py

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks & benchmarks tables

* Add FinanceBenchRetrieval task (#3044)

* Add FinanceBenchRetrieval

* Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks & benchmarks tables

* Add FreshStackRetrieval task (#3043)

* Add FreshStackRetrieval

* Reformatting, correcting the revision

* Dataset correction

* Update tasks & benchmarks tables

* dataset: Add ds1000 retrieval (#3038)

* Add DS1000 retrieval task

- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries

* Add DS1000Retrieval to imports

* Add descriptive statistics for DS1000Retrieval

* Reformatting

* Reformatting

* Update tasks & benchmarks tables

* Add ChatDoctorRetrieval (#3045)

* Add ChatDoctorRetrieval

* Reformatting, correcting the revision

* Correct the dataset citation

* Correcting due to comments

* Update tasks & benchmarks tables

* Correcting the (new) DS1000 dataset's revision (#3063)

* Add DS1000 retrieval task

- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries

* Add DS1000Retrieval to imports

* Add descriptive statistics for DS1000Retrieval

* Reformatting

* Reformatting

* Add DS1000Retrieval task implementation

* dataset: Add JinaVDR (#2942)

* feat: added jinavdr benchmark

* feat: added description for jinavdr

* feat: fixed licenses and added bibtex

* feat: made jinav4 compatible with vidore benchmark

* feat: corrected query numbers

* feat: removed print

* feat: added max pixel argument for jina models

* feat: score calculation on cpu

* feat: adjust jina model for new mteb code

* feat: code cleanup

* feat: corrected bibtex

* feat: make colpali run with jinavdr

* feat: fixed comments

* feat: better reference and fixed comments

* feat: added date for tasks

* feat: fixed missing metadata and bibtex

* feat: added descriptions per dataset

* Update tasks & benchmarks tables

* model: Add CoDi-Embedding-V1 (#3054)

* add codiemb-minicpm

* replace codiemb_minicpm with codi_model

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* update code

* update code

* reformat

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: ensure that there are always relevant docs attached to query (#3058)

* fix: ensure that there are always relevant docs attached to query

Here is brief test that it doesn't influence scores:
```py
t1 = mteb.get_task("TwitterHjerneRetrieval")
meta = mteb.get_model_meta("minishlab/potion-base-2M")

eval = mteb.MTEB(tasks=[t1])
res = eval.run(model=meta.load_model())

# before fix:
res[0].get_score()  # np.float64(0.02837)
res[0].scores
before_fix = {
    "train": [
        {
            "ndcg_at_1": 0.02597,
            "ndcg_at_3": 0.02213,
            "ndcg_at_5": 0.0262,
            "ndcg_at_10": 0.02837,
            "ndcg_at_20": 0.04548,
            "ndcg_at_100": 0.13527,
            "ndcg_at_1000": 0.24507,
            "map_at_1": 0.00866,
            "map_at_3": 0.01317,
            "map_at_5": 0.0149,
            "map_at_10": 0.01562,
            "map_at_20": 0.01898,
            "map_at_100": 0.02968,
            "map_at_1000": 0.03841,
            "recall_at_1": 0.00866,
            "recall_at_3": 0.02056,
            "recall_at_5": 0.02922,
            "recall_at_10": 0.03355,
            "recall_at_20": 0.08268,
            "recall_at_100": 0.43766,
            "recall_at_1000": 1.0,
            "precision_at_1": 0.02597,
            "precision_at_3": 0.02165,
            "precision_at_5": 0.01818,
            "precision_at_10": 0.01039,
            "precision_at_20": 0.01234,
            "precision_at_100": 0.01481,
            "precision_at_1000": 0.0034,
            "mrr_at_1": 0.025974,
            "mrr_at_3": 0.041126,
            "mrr_at_5": 0.04632,
            "mrr_at_10": 0.048485,
            "mrr_at_20": 0.058356,
            "mrr_at_100": 0.070186,
            "mrr_at_1000": 0.071349,
            "nauc_ndcg_at_1_max": 0.33969,
            "nauc_ndcg_at_1_std": -0.202864,
            "nauc_ndcg_at_1_diff1": -0.127,
            "nauc_ndcg_at_3_max": 0.409376,
            "nauc_ndcg_at_3_std": -0.039352,
            "nauc_ndcg_at_3_diff1": -0.022816,
            "nauc_ndcg_at_5_max": 0.250499,
            "nauc_ndcg_at_5_std": -0.115263,
            "nauc_ndcg_at_5_diff1": -0.057017,
            "nauc_ndcg_at_10_max": 0.238696,
            "nauc_ndcg_at_10_std": -0.138396,
            "nauc_ndcg_at_10_diff1": -0.045287,
            "nauc_ndcg_at_20_max": 0.154456,
            "nauc_ndcg_at_20_std": -0.070635,
            "nauc_ndcg_at_20_diff1": 0.074499,
            "nauc_ndcg_at_100_max": -0.005753,
            "nauc_ndcg_at_100_std": -0.074738,
            "nauc_ndcg_at_100_diff1": -0.005851,
            "nauc_ndcg_at_1000_max": 0.109439,
            "nauc_ndcg_at_1000_std": -0.089797,
            "nauc_ndcg_at_1000_diff1": -0.021634,
            "nauc_map_at_1_max": 0.33969,
            "nauc_map_at_1_std": -0.202864,
            "nauc_map_at_1_diff1": -0.127,
            "nauc_map_at_3_max": 0.385244,
            "nauc_map_at_3_std": -0.080638,
            "nauc_map_at_3_diff1": -0.060991,
            "nauc_map_at_5_max": 0.294871,
            "nauc_map_at_5_std": -0.119069,
            "nauc_map_at_5_diff1": -0.06234,
            "nauc_map_at_10_max": 0.285698,
            "nauc_map_at_10_std": -0.132856,
            "nauc_map_at_10_diff1": -0.055015,
            "nauc_map_at_20_max": 0.236619,
            "nauc_map_at_20_std": -0.100673,
            "nauc_map_at_20_diff1": -0.002619,
            "nauc_map_at_100_max": 0.15345,
            "nauc_map_at_100_std": -0.138888,
            "nauc_map_at_100_diff1": -0.02257,
            "nauc_map_at_1000_max": 0.171402,
            "nauc_map_at_1000_std": -0.134644,
            "nauc_map_at_1000_diff1": -0.034477,
            "nauc_recall_at_1_max": 0.33969,
            "nauc_recall_at_1_std": -0.202864,
            "nauc_recall_at_1_diff1": -0.127,
            "nauc_recall_at_3_max": 0.375072,
            "nauc_recall_at_3_std": -0.009643,
            "nauc_recall_at_3_diff1": -0.089168,
            "nauc_recall_at_5_max": 0.147691,
            "nauc_recall_at_5_std": -0.128654,
            "nauc_recall_at_5_diff1": -0.084259,
            "nauc_recall_at_10_max": 0.141055,
            "nauc_recall_at_10_std": -0.165932,
            "nauc_recall_at_10_diff1": -0.060966,
            "nauc_recall_at_20_max": 0.043863,
            "nauc_recall_at_20_std": -0.028374,
            "nauc_recall_at_20_diff1": 0.157575,
            "nauc_recall_at_100_max": -0.157183,
            "nauc_recall_at_100_std": -0.019437,
            "nauc_recall_at_100_diff1": 0.013395,
            # "nauc_recall_at_1000_max": nan,
            # "nauc_recall_at_1000_std": nan,
            # "nauc_recall_at_1000_diff1": nan,
            "nauc_precision_at_1_max": 0.33969,
            "nauc_precision_at_1_std": -0.202864,
            "nauc_precision_at_1_diff1": -0.127,
            "nauc_precision_at_3_max": 0.406318,
            "nauc_precision_at_3_std": 0.007031,
            "nauc_precision_at_3_diff1": -0.034709,
            "nauc_precision_at_5_max": 0.178131,
            "nauc_precision_at_5_std": -0.112493,
            "nauc_precision_at_5_diff1": -0.045535,
            "nauc_precision_at_10_max": 0.167897,
            "nauc_precision_at_10_std": -0.150626,
            "nauc_precision_at_10_diff1": -0.027811,
            "nauc_precision_at_20_max": 0.081428,
            "nauc_precision_at_20_std": -0.042304,
            "nauc_precision_at_20_diff1": 0.17278,
            "nauc_precision_at_100_max": -0.150619,
            "nauc_precision_at_100_std": 0.016133,
            "nauc_precision_at_100_diff1": -0.065571,
            "nauc_precision_at_1000_max": -0.017244,
            "nauc_precision_at_1000_std": 0.046614,
            "nauc_precision_at_1000_diff1": -0.028258,
            "nauc_mrr_at_1_max": 0.33969,
            "nauc_mrr_at_1_std": -0.202864,
            "nauc_mrr_at_1_diff1": -0.127,
            "nauc_mrr_at_3_max": 0.409511,
            "nauc_mrr_at_3_std": -0.064671,
            "nauc_mrr_at_3_diff1": -0.01911,
            "nauc_mrr_at_5_max": 0.319584,
            "nauc_mrr_at_5_std": -0.103546,
            "nauc_mrr_at_5_diff1": -0.025109,
            "nauc_mrr_at_10_max": 0.309614,
            "nauc_mrr_at_10_std": -0.117564,
            "nauc_mrr_at_10_diff1": -0.019691,
            "nauc_mrr_at_20_max": 0.262976,
            "nauc_mrr_at_20_std": -0.092222,
            "nauc_mrr_at_20_diff1": 0.024507,
            "nauc_mrr_at_100_max": 0.256052,
            "nauc_mrr_at_100_std": -0.094249,
            "nauc_mrr_at_100_diff1": 0.012432,
            "nauc_mrr_at_1000_max": 0.260112,
            "nauc_mrr_at_1000_std": -0.098845,
            "nauc_mrr_at_1000_diff1": 0.009697,
            "main_score": 0.02837,
            "hf_subset": "default",
            "languages": ["dan-Latn"],
        }
    ]
}

# with update:
res[0].get_score()  # np.float64(0.02837)
res[0].scores
with_fix = {
    "train": [
        {
            "ndcg_at_1": 0.02597,
            "ndcg_at_3": 0.02213,
            "ndcg_at_5": 0.0262,
            "ndcg_at_10": 0.02837,
            "ndcg_at_20": 0.04548,
            "ndcg_at_100": 0.13527,
            "ndcg_at_1000": 0.24507,
            "map_at_1": 0.00866,
            "map_at_3": 0.01317,
            "map_at_5": 0.0149,
            "map_at_10": 0.01562,
            "map_at_20": 0.01898,
            "map_at_100": 0.02968,
            "map_at_1000": 0.03841,
            "recall_at_1": 0.00866,
            "recall_at_3": 0.02056,
            "recall_at_5": 0.02922,
            "recall_at_10": 0.03355,
            "recall_at_20": 0.08268,
            "recall_at_100": 0.43766,
            "recall_at_1000": 1.0,
            "precision_at_1": 0.02597,
            "precision_at_3": 0.02165,
            "precision_at_5": 0.01818,
            "precision_at_10": 0.01039,
            "precision_at_20": 0.01234,
            "precision_at_100": 0.01481,
            "precision_at_1000": 0.0034,
            "mrr_at_1": 0.025974,
            "mrr_at_3": 0.041126,
            "mrr_at_5": 0.04632,
            "mrr_at_10": 0.048485,
            "mrr_at_20": 0.058356,
            "mrr_at_100": 0.070186,
            "mrr_at_1000": 0.071349,
            "nauc_ndcg_at_1_max": 0.33969,
            "nauc_ndcg_at_1_std": -0.202864,
            "nauc_ndcg_at_1_diff1": -0.127,
            "nauc_ndcg_at_3_max": 0.409376,
            "nauc_ndcg_at_3_std": -0.039352,
            "nauc_ndcg_at_3_diff1": -0.022816,
            "nauc_ndcg_at_5_max": 0.250499,
            "nauc_ndcg_at_5_std": -0.115263,
            "nauc_ndcg_at_5_diff1": -0.057017,
            "nauc_ndcg_at_10_max": 0.238696,
            "nauc_ndcg_at_10_std": -0.138396,
            "nauc_ndcg_at_10_diff1": -0.045287,
            "nauc_ndcg_at_20_max": 0.154456,
            "nauc_ndcg_at_20_std": -0.070635,
            "nauc_ndcg_at_20_diff1": 0.074499,
            "nauc_ndcg_at_100_max": -0.005753,
            "nauc_ndcg_at_100_std": -0.074738,
            "nauc_ndcg_at_100_diff1": -0.005851,
            "nauc_ndcg_at_1000_max": 0.109439,
            "nauc_ndcg_at_1000_std": -0.089797,
            "nauc_ndcg_at_1000_diff1": -0.021634,
            "nauc_map_at_1_max": 0.33969,
            "nauc_map_at_1_std": -0.202864,
            "nauc_map_at_1_diff1": -0.127,
            "nauc_map_at_3_max": 0.385244,
            "nauc_map_at_3_std": -0.080638,
            "nauc_map_at_3_diff1": -0.060991,
            "nauc_map_at_5_max": 0.294871,
            "nauc_map_at_5_std": -0.119069,
            "nauc_map_at_5_diff1": -0.06234,
            "nauc_map_at_10_max": 0.285698,
            "nauc_map_at_10_std": -0.132856,
            "nauc_map_at_10_diff1": -0.055015,
            "nauc_map_at_20_max": 0.236619,
            "nauc_map_at_20_std": -0.100673,
            "nauc_map_at_20_diff1": -0.002619,
            "nauc_map_at_100_max": 0.15345,
            "nauc_map_at_100_std": -0.138888,
            "nauc_map_at_100_diff1": -0.02257,
            "nauc_map_at_1000_max": 0.171402,
            "nauc_map_at_1000_std": -0.134644,
            "nauc_map_at_1000_diff1": -0.034477,
            "nauc_recall_at_1_max": 0.33969,
            "nauc_recall_at_1_std": -0.202864,
            "nauc_recall_at_1_diff1": -0.127,
            "nauc_recall_at_3_max": 0.375072,
            "nauc_recall_at_3_std": -0.009643,
            "nauc_recall_at_3_diff1": -0.089168,
            "nauc_recall_at_5_max": 0.147691,
            "nauc_recall_at_5_std": -0.128654,
            "nauc_recall_at_5_diff1": -0.084259,
            "nauc_recall_at_10_max": 0.141055,
            "nauc_recall_at_10_std": -0.165932,
            "nauc_recall_at_10_diff1": -0.060966,
            "nauc_recall_at_20_max": 0.043863,
            "nauc_recall_at_20_std": -0.028374,
            "nauc_recall_at_20_diff1": 0.157575,
            "nauc_recall_at_100_max": -0.157183,
            "nauc_recall_at_100_std": -0.019437,
            "nauc_recall_at_100_diff1": 0.013395,
            # "nauc_recall_at_1000_max": nan,
            # "nauc_recall_at_1000_std": nan,
            # "nauc_recall_at_1000_diff1": nan,
            "nauc_precision_at_1_max": 0.33969,
            "nauc_precision_at_1_std": -0.202864,
            "nauc_precision_at_1_diff1": -0.127,
            "nauc_precision_at_3_max": 0.406318,
            "nauc_precision_at_3_std": 0.007031,
            "nauc_precision_at_3_diff1": -0.034709,
            "nauc_precision_at_5_max": 0.178131,
            "nauc_precision_at_5_std": -0.112493,
            "nauc_precision_at_5_diff1": -0.045535,
            "nauc_precision_at_10_max": 0.167897,
            "nauc_precision_at_10_std": -0.150626,
            "nauc_precision_at_10_diff1": -0.027811,
            "nauc_precision_at_20_max": 0.081428,
            "nauc_precision_at_20_std": -0.042304,
            "nauc_precision_at_20_diff1": 0.17278,
            "nauc_precision_at_100_max": -0.150619,
            "nauc_precision_at_100_std": 0.016133,
            "nauc_precision_at_100_diff1": -0.065571,
            "nauc_precision_at_1000_max": -0.017244,
            "nauc_precision_at_1000_std": 0.046614,
            "nauc_precision_at_1000_diff1": -0.028258,
            "nauc_mrr_at_1_max": 0.33969,
            "nauc_mrr_at_1_std": -0.202864,
            "nauc_mrr_at_1_diff1": -0.127,
            "nauc_mrr_at_3_max": 0.409511,
            "nauc_mrr_at_3_std": -0.064671,
            "nauc_mrr_at_3_diff1": -0.01911,
            "nauc_mrr_at_5_max": 0.319584,
            "nauc_mrr_at_5_std": -0.103546,
            "nauc_mrr_at_5_diff1": -0.025109,
            "nauc_mrr_at_10_max": 0.309614,
            "nauc_mrr_at_10_std": -0.117564,
            "nauc_mrr_at_10_diff1": -0.019691,
            "nauc_mrr_at_20_max": 0.262976,
            "nauc_mrr_at_20_std": -0.092222,
            "nauc_mrr_at_20_diff1": 0.024507,
            "nauc_mrr_at_100_max": 0.256052,
            "nauc_mrr_at_100_std": -0.094249,
            "nauc_mrr_at_100_diff1": 0.012432,
            "nauc_mrr_at_1000_max": 0.260112,
            "nauc_mrr_at_1000_std": -0.098845,
            "nauc_mrr_at_1000_diff1": 0.009697,
            "main_score": 0.02837,
            "hf_subset": "default",
            "languages": ["dan-Latn"],
        }
    ]
}

# check
with_fix == before_fix  # True

* restructure

* format

* relax pytrec versions

* fix incorrect parsing

* 1.38.44

Automatically generated by python-semantic-release

* Correcting the JINA models with SentenceTransformerWrapper (#3071)

* ci: Add stale workflow (#3066)

* add stale workflow

* add permissions

* add bug label to bug issue template

* revert bug issue and only look at more info needed issues

* more accurate name

* override default

* fix: open_clip package validation (#3073)

* 1.38.45

Automatically generated by python-semantic-release

* fix: Update revision for  qzhou models (#3069)

* 1.38.46

Automatically generated by python-semantic-release

* Fix the reference link for CoDi-Embedding-V1 (#3075)

Fix reference link

* fix: Add beta version of RTEB related benchmarks (#3048)

* Add RTEB related benchmarks

* Add RTEB related benchmarks

* Correcting the task names in the RTEB benchmarks

* Update mteb/leaderboard/benchmark_selector.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Adding the CURE dataset to RTEB benchmarks

* Use the right language subset

* Fix broken finance icon URL in RTEB benchmarks

Replace broken libre-finance-dollar.svg with working libre-gui-price-tag.svg
Validated all icon URLs and confirmed accessibility compliance

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* 1.38.47

Automatically generated by python-semantic-release

* fix: run `ruff check` on all files during ci (#3086)

* fix: run `ruff check` on all files during ci

* format

* 1.38.48

Automatically generated by python-semantic-release

* Move dev to dependency groups (#3088)

add dependency groups

* fix: Improving validate_task_to_prompt_name logs and error messages (#3079)

* Improving validate_task_to_prompt_name logs and error messages

* linter fixes

* Adding None prompts tests

* Update test_benchmark_sentence_transformer

* Update mteb/leaderboard/benchmark_selector.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: duplicate mteb multilingual variables (#3080)

* fix benchmark naming

* format

* lint

* Update tasks & benchmarks tables

* model: mdbr-leaf models (#3081)

* added MDBR leaf models

* fixed revision for mdbr-leaf-ir

* added model prompts

* updated training datasets

* fixed linting

* lotte task reference

---------

Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com>

* 1.38.49

Automatically generated by python-semantic-release

* CI: Set upper limit for xdist version  (#3098)

* Commentout bibtex formatting

* Remove `-n auto`

* get back bibtex

* try limiting versions

* revert coverage

* revert coverage

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Combine Plots and Tables into a Single (#3047)

* feat - Combine Plots and Tables into a Single Tab #3009

* feat - Resize the plot to make it more readable

* feat - Remove the (radar chart)

* feat - Add a comment stating that it only shows the Top 5 models in the table.

* feat - adjust layout

* Update mteb/leaderboard/app.py

* format

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* mteb importable

* format

* fix model implementations

* fix `validate_task_to_prompt_name`

* align regression task with others

* remove model overview

* remove partials

* format

* fix tests

* fix evaluators tests

* add trust remote code to bsard

* pre-commit run all files

* add all descriptive stats

* fix trust remote code test

* add `RetrievalSplitData` to reranking

---------

Signed-off-by: admin <bo.wang@jina.ai>
Co-authored-by: Mohammad Kalim Akram <kalim.akram@jina.ai>
Co-authored-by: ItsukiFujii <42373615+ItsukiFujii@users.noreply.github.com>
Co-authored-by: xinshuohu <xinshuohu@tencent.com>
Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com>
Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Paul Teiletche <73120933+paultltc@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Alexey Vatolin <vatolinalex@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: lsz05 <shengzhe.li@sbintuitions.co.jp>
Co-authored-by: zhichao-aws <zhichaog@amazon.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Abdur-Rahman Butler <79828536+abdurrahmanbutler@users.noreply.github.com>
Co-authored-by: Feiyang <feiyangc@google.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: semantic-release <semantic-release>
Co-authored-by: Nikolay Banar <nikc20008@gmail.com>
Co-authored-by: Penny Yu <51702222+PennyYu123@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: fzowl <zoltan@voyageai.com>
Co-authored-by: Bao Loc Pham <67360122+BaoLocPham@users.noreply.github.com>
Co-authored-by: Kritias <50093609+ElPlaguister@users.noreply.github.com>
Co-authored-by: roipony <roipony@gmail.com>
Co-authored-by: Aashka Trivedi <aashka.trivedi@gmail.com>
Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com>
Co-authored-by: admin <bo.wang@jina.ai>
Co-authored-by: Maximilian Werk <maximilian.werk@gmx.de>
Co-authored-by: Victor <zbwkeepgoing@126.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
Co-authored-by: Robin Vujanic <robin-vjc@users.noreply.github.com>
Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com>
Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com>
isaac-chung added a commit that referenced this pull request Sep 1, 2025
* model: add image support for jina embeddings v4 (#2893)

* feat: unify text and image embeddings for all tasks

* fix: uniform batch size

* fix: update error message

* fix: update code task

* fix: update max length

* fix: apply review suggestions

* model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889)

* feat: add KaLM_Embedding_X_0605 in kalm_models

* Update kalm_models.py for lint format

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

---------

Co-authored-by: xinshuohu <xinshuohu@tencent.com>
Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com>

* Add Classification Evaluator unit test (#2838)

* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* fix: update colpali engine models (#2905)

* adding vidore benchmarks

* fix typo

* clean vidore names + per lang eval

* lint

* vidore names

* bibtex fix

* fix revision

* vidore v2 citation

* update citation format and fix per-language mappings

* lint: citations

* typo citations

* fix revisiions

* lint

* fix colnomic3b revision

* fix colqwen2.5 revision + latest repo version

* fix query agmentation tokens

* colsmol revision

* 1.38.35

Automatically generated by python-semantic-release

* Evaluator tests (#2910)

* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

* Adding STSEvaluator and SummarizationEvaluator tests

* Correcting due to the comments

* Correcting due to the comments

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Classification dataset cleaning (#2900)

* Classification dataset cleaning

* Update pull request number

* Fix metadata test

* fix formatting

* add script for cleaning

* Update tasks & benchmarks tables

* dataset: Add JapaneseSentimentClassification (#2913)

Add JapaneseSentimentClassification

* Update tasks & benchmarks tables

* fix: change `passage` prompt to `document`  (#2912)

* change document to passage

* fix prompt names

* fix kwargs check

* fix default prompt

* 1.38.36

Automatically generated by python-semantic-release

* model: Add OpenSearch inf-free sparse encoding models (#2903)

add opensearch inf-free models

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* dataset: add BarExamQA dataset (#2916)

* Add BareExamQA retrieval task

* ran linter

* updated details

* updated details

* fixed subtype name

* fixed changes

* ran linter again

* Use `mteb.get_model` in adding_a_dataset.md (#2922)

Update adding_a_dataset.md

* fix: specify revision for opensearch (#2919)

specify revision for opensearch

* 1.38.37

Automatically generated by python-semantic-release

* Update the link for gemini-embedding-001 (#2928)

* fix: replace with passage (#2934)

* fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940)

* fix: Only import SparseEncoder once sentence-transformer version have been checked

fixes #2936

* Update mteb/models/opensearch_neural_sparse_models.py

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939)

The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue.

* docs: Update adding_a_dataset.md (#2947)

* docs: Update adding_a_dataset.md

* Update docs/adding_a_dataset.md

* ci: bump semantic release

* 1.38.38

Automatically generated by python-semantic-release

* dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935)

* BSARD loader fixed

* BSARDv2 metadata fixed

* Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tasks & benchmarks tables

* dataset: add GovReport dataset (#2953)

* Added govreport task

* Updated description

* dataset: add BillSum datasets (#2943)

* Added BillSum datasets

* fixed billsumca

* Updated BillSumCA description

* Updated BillSumUS description

* Update mteb/tasks/Retrieval/eng/BillSumCA.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/BillSumUS.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* lint

* lint

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks & benchmarks tables

* fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716)

* Add RuSciBench

* fix bitext mining lang

* Add regression task

* fix init

* add missing files

* Improve description

* Add superseded_by

* fix lint

* Update regression task to match with v2

* Add stratified_subsampling for regression task

* Add boostrap for regression task

* Rename task class, add model as evaluator argument

* fix import

* fix import 2

* fixes

* fix

* Rename regression model protocol

* Update tasks & benchmarks tables

* 1.38.39

Automatically generated by python-semantic-release

* qzhou-embedding model_meta & implementation (#2975)

* qzhou-embedding model_meta & implementation

* Update qzhou_models.py

* Update qzhou_models.py

Processing todo items(Add default instruction)

* Update qzhou_models.py

correct bge datalist

* Update qzhou_models.py

correct 'public_training_data'

* Update qzhou_models.py

* Update qzhou_models.py

* Update qzhou_models.py

* Update qzhou_models.py

* Update mteb/models/qzhou_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/qzhou_models.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* format qzhou_models.py for ruff check

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* model: Add Voyage 3.5 model configuration (#3005)

Add Voyage 3.5 model configuration

- Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens
- Set release date to 2025-01-21 with revision 1
- Configure for cosine similarity with instruction support
- Include standard Voyage training datasets reference

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <noreply@anthropic.com>

* model: BAAI/bge-m3-unsupervised Model (#3007)

* Add BAAI/bge-m3-unsupervised Model
(BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out)

* Remove the commented retromae model

---------

Co-authored-by: fzowl <zoltan@voyageai.com>

* lint: Correcting lint errors (#3004)

* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

* Correcting the lint errors

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* dataset: Added 50 Vietnamese dataset from vn-mteb (#2964)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [REMOVE] default fields metadata in Classfication tasks

* Update tasks & benchmarks tables

* model: Add Cohere embed-v4.0 model support (#3006)

* Add Cohere embed-v4.0 model support

- Add text-only embed-v4.0 model in cohere_models.py
- Add multimodal embed-v4.0 model in cohere_v.py
- Support configurable dimensions (256, 512, 1024, 1536)
- Support 128,000 token context length
- Support multimodal embedding (text, images, mixed PDFs)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Add Cohere embed-v4.0 model support

Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Add OpenAI models with 512 dimension (#3008)

* Add OpenAI/text-embedding-3-small (512 dim)
Add OpenAI/text-embedding-3-large (512 dim)

* Correcting due to comments

---------

Co-authored-by: fzowl <zoltan@voyageai.com>

* Standardise task names and fix citation formatting (#3026)

fixes for name formatting

* Update tasks & benchmarks tables

* fix: Add missing training sets for qzhou (#3023)

* Supplement missing training sets

* reformat code

* Reorganize the data list format

* update qzhou_model meta

* 1.38.40

Automatically generated by python-semantic-release

* model: Add samilpwc_models meta (#3028)

* model: Add samilpwc_models meta

* Fix: Remove CONST

* Fix: Reformat File

* Update: model revision

* model: Add granite-vision-embedding model  (#3029)

* Add files via upload

* Address review comments

* Address review comments

* ruff format

* Update mteb/models/granite_vision_embedding_models.py

* lint error fix

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* fix: incorrect revision for SNLRetrieval (#3033)

The provided revisions doesn't seem to be present on:
adrlau/navjordj-SNL_summarization_copy

Replacing with latest revision

* dataset: Add HumanEvalRetrieval task (#3022)

* Add HumanEvalRetrieval dataset

* Fix TaskMetadata structure and remove descriptive_stats

- Use TaskMetadata class instead of dict
- Remove descriptive_stats as requested in PR review
- Add date field and proper import structure

* Fix dataset path and use verified metadata

- Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval
- Use actual description from HuggingFace dataset page
- Remove fabricated citation and reference
- Remove revision field that was incorrect
- Reference HuggingFace dataset page instead of arxiv

* Add correct revision hash to HumanEval

- Add revision hash: ed1f48a for reproducibility

* Fix HumanEval metadata validation

- Add date field for metadata completeness
- Add bibtex_citation field (empty string)
- Required for TaskMetadata validation to pass
- Should resolve PR test failure

* Address reviewer feedback

- Remove trust_remote_code parameter as requested
- Add revision parameter to load_dataset() calls for consistency
- Use metadata revision hash in dataset loading for reproducibility

* Fix field names in HumanEval dataset loading

Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format.

* Fix deprecated metadata_dict usage

Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility.

* Fix data structure for MTEB compatibility

- Organize data by splits as expected by MTEB retrieval tasks
- Convert scores to integers for pytrec_eval compatibility

* Address PR feedback for HumanEval dataset

- Add descriptive statistics using calculate_metadata_metrics()
- Enhance metadata description with dataset structure details
- Add complete BibTeX citation for original paper
- Update to full commit hash revision
- Add python-Code language tag for programming language
- Explain retrieval task formulation clearly

* Fix BibTeX citation formatting for HumanEvalRetrieval

- Update citation to match bibtexparser formatting requirements
- Fields now in alphabetical order with lowercase names
- Proper trailing commas and indentation

* Update tasks & benchmarks tables

* 1.38.41

Automatically generated by python-semantic-release

* ci: reduce parallel runs for when checking if a dataset exists (#3035)

The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831)

* ci: Updating rerun delays to prevent false positives errors

* ci: Updating rerun delays to prevent false positives errors

* model: Add GreenNode Vietnamese Embedding models (#2994)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [ADD] Vietnamese Embedding models

* [REMOVE] default fields metadata in Classfication tasks

* [UPDATE] model to vi-vn language specific file

* [FIX] lint

* [FIX] model loader

* model: add granite-embedding-english R2 models (#3050)

* fix: Updated revision for jina-embeddings-v4 (#3046)

* fix: jinav4 revision

Signed-off-by: admin <bo.wang@jina.ai>

* change revision instead of removing it

Signed-off-by: admin <bo.wang@jina.ai>

---------

Signed-off-by: admin <bo.wang@jina.ai>
Co-authored-by: admin <bo.wang@jina.ai>

* 1.38.42

Automatically generated by python-semantic-release

* Fix 3 VN-MTEB Pair Classification tasks (#3053)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [ADD] Vietnamese Embedding models

* [REMOVE] default fields metadata in Classfication tasks

* [UPDATE] model to vi-vn language specific file

* [FIX] lint

* [FIX] model loader

* [FIX] VN-MTEB 3 datasets PairClassification rename column

* dataset: Add mbpp retrieval (#3037)

* Add MBPP retrieval task

- Code retrieval task based on 378 Python programming problems
- Natural language queries matched to Python code implementations
- Uses python-Code evaluation language for code-specific metrics
- Includes proper citations and descriptive statistics

* Add MBPPRetrieval to imports

* Add descriptive statistics for MBPPRetrieval

* Reformatting

* Reformatting

* Update tasks & benchmarks tables

* dataset: Added wikisql retrieval (#3039)

* Add WikiSQL retrieval task

- Code retrieval task based on WikiSQL natural language to SQL dataset
- Natural language questions matched to SQL query implementations
- Uses sql-Code evaluation language for SQL-specific metrics
- Includes proper citations and descriptive statistics

* Add WikiSQLRetrieval to imports

* Add descriptive statistics for WikiSQLRetrieval

* Reformatting

* Reformatting

* Reformatting, correcting the revision

* Update tasks & benchmarks tables

* ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors

try to fix CI

* fix MBPPRetrieval revision (#3055)

Update MBPPRetrieval.py

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix: Add VN-MTEB benchmark and Leaderboard (#2995)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [ADD] VN-MTEB benchmark and leaderboard

* [FIX] wrong benchmark name

* [REMOVE] default fields metadata in Classfication tasks

* Update tasks & benchmarks tables

* 1.38.43

Automatically generated by python-semantic-release

* Add hc3finance retrieval (#3041)

* Add HC3Finance retrieval task

- Financial retrieval task based on HC3 Finance dataset
- Financial questions matched to human and AI-generated content
- Covers financial explanations, analysis, and educational content
- Includes proper citations and descriptive statistics

* Add HC3FinanceRetrieval to imports

* Add descriptive statistics for HC3FinanceRetrieval

* Reformatting

* Reformatting, correcting the revision

* Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Add finqa retrieval (#3042)

* Add FinQA retrieval task

- Financial numerical reasoning retrieval task based on FinQA dataset
- Numerical financial questions matched to relevant document data
- Covers earnings reports with tables and quantitative financial data
- Includes proper citations and descriptive statistics

* Add FinQARetrieval to imports

* Add descriptive statistics for FinQARetrieval

* Reformatting

* Reformatting

* Update mteb/tasks/Retrieval/eng/FinQARetrieval.py

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks & benchmarks tables

* Add FinanceBenchRetrieval task (#3044)

* Add FinanceBenchRetrieval

* Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks & benchmarks tables

* Add FreshStackRetrieval task (#3043)

* Add FreshStackRetrieval

* Reformatting, correcting the revision

* Dataset correction

* Update tasks & benchmarks tables

* dataset: Add ds1000 retrieval (#3038)

* Add DS1000 retrieval task

- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries

* Add DS1000Retrieval to imports

* Add descriptive statistics for DS1000Retrieval

* Reformatting

* Reformatting

* Update tasks & benchmarks tables

* Add ChatDoctorRetrieval (#3045)

* Add ChatDoctorRetrieval

* Reformatting, correcting the revision

* Correct the dataset citation

* Correcting due to comments

* Update tasks & benchmarks tables

* Correcting the (new) DS1000 dataset's revision (#3063)

* Add DS1000 retrieval task

- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries

* Add DS1000Retrieval to imports

* Add descriptive statistics for DS1000Retrieval

* Reformatting

* Reformatting

* Add DS1000Retrieval task implementation

* dataset: Add JinaVDR (#2942)

* feat: added jinavdr benchmark

* feat: added description for jinavdr

* feat: fixed licenses and added bibtex

* feat: made jinav4 compatible with vidore benchmark

* feat: corrected query numbers

* feat: removed print

* feat: added max pixel argument for jina models

* feat: score calculation on cpu

* feat: adjust jina model for new mteb code

* feat: code cleanup

* feat: corrected bibtex

* feat: make colpali run with jinavdr

* feat: fixed comments

* feat: better reference and fixed comments

* feat: added date for tasks

* feat: fixed missing metadata and bibtex

* feat: added descriptions per dataset

* Update tasks & benchmarks tables

* model: Add CoDi-Embedding-V1 (#3054)

* add codiemb-minicpm

* replace codiemb_minicpm with codi_model

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* update code

* update code

* reformat

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: ensure that there are always relevant docs attached to query (#3058)

* fix: ensure that there are always relevant docs attached to query

Here is brief test that it doesn't influence scores:
```py
t1 = mteb.get_task("TwitterHjerneRetrieval")
meta = mteb.get_model_meta("minishlab/potion-base-2M")

eval = mteb.MTEB(tasks=[t1])
res = eval.run(model=meta.load_model())

# before fix:
res[0].get_score()  # np.float64(0.02837)
res[0].scores
before_fix = {
    "train": [
        {
            "ndcg_at_1": 0.02597,
            "ndcg_at_3": 0.02213,
            "ndcg_at_5": 0.0262,
            "ndcg_at_10": 0.02837,
            "ndcg_at_20": 0.04548,
            "ndcg_at_100": 0.13527,
            "ndcg_at_1000": 0.24507,
            "map_at_1": 0.00866,
            "map_at_3": 0.01317,
            "map_at_5": 0.0149,
            "map_at_10": 0.01562,
            "map_at_20": 0.01898,
            "map_at_100": 0.02968,
            "map_at_1000": 0.03841,
            "recall_at_1": 0.00866,
            "recall_at_3": 0.02056,
            "recall_at_5": 0.02922,
            "recall_at_10": 0.03355,
            "recall_at_20": 0.08268,
            "recall_at_100": 0.43766,
            "recall_at_1000": 1.0,
            "precision_at_1": 0.02597,
            "precision_at_3": 0.02165,
            "precision_at_5": 0.01818,
            "precision_at_10": 0.01039,
            "precision_at_20": 0.01234,
            "precision_at_100": 0.01481,
            "precision_at_1000": 0.0034,
            "mrr_at_1": 0.025974,
            "mrr_at_3": 0.041126,
            "mrr_at_5": 0.04632,
            "mrr_at_10": 0.048485,
            "mrr_at_20": 0.058356,
            "mrr_at_100": 0.070186,
            "mrr_at_1000": 0.071349,
            "nauc_ndcg_at_1_max": 0.33969,
            "nauc_ndcg_at_1_std": -0.202864,
            "nauc_ndcg_at_1_diff1": -0.127,
            "nauc_ndcg_at_3_max": 0.409376,
            "nauc_ndcg_at_3_std": -0.039352,
            "nauc_ndcg_at_3_diff1": -0.022816,
            "nauc_ndcg_at_5_max": 0.250499,
            "nauc_ndcg_at_5_std": -0.115263,
            "nauc_ndcg_at_5_diff1": -0.057017,
            "nauc_ndcg_at_10_max": 0.238696,
            "nauc_ndcg_at_10_std": -0.138396,
            "nauc_ndcg_at_10_diff1": -0.045287,
            "nauc_ndcg_at_20_max": 0.154456,
            "nauc_ndcg_at_20_std": -0.070635,
            "nauc_ndcg_at_20_diff1": 0.074499,
            "nauc_ndcg_at_100_max": -0.005753,
            "nauc_ndcg_at_100_std": -0.074738,
            "nauc_ndcg_at_100_diff1": -0.005851,
            "nauc_ndcg_at_1000_max": 0.109439,
            "nauc_ndcg_at_1000_std": -0.089797,
            "nauc_ndcg_at_1000_diff1": -0.021634,
            "nauc_map_at_1_max": 0.33969,
            "nauc_map_at_1_std": -0.202864,
            "nauc_map_at_1_diff1": -0.127,
            "nauc_map_at_3_max": 0.385244,
            "nauc_map_at_3_std": -0.080638,
            "nauc_map_at_3_diff1": -0.060991,
            "nauc_map_at_5_max": 0.294871,
            "nauc_map_at_5_std": -0.119069,
            "nauc_map_at_5_diff1": -0.06234,
            "nauc_map_at_10_max": 0.285698,
            "nauc_map_at_10_std": -0.132856,
            "nauc_map_at_10_diff1": -0.055015,
            "nauc_map_at_20_max": 0.236619,
            "nauc_map_at_20_std": -0.100673,
            "nauc_map_at_20_diff1": -0.002619,
            "nauc_map_at_100_max": 0.15345,
            "nauc_map_at_100_std": -0.138888,
            "nauc_map_at_100_diff1": -0.02257,
            "nauc_map_at_1000_max": 0.171402,
            "nauc_map_at_1000_std": -0.134644,
            "nauc_map_at_1000_diff1": -0.034477,
            "nauc_recall_at_1_max": 0.33969,
            "nauc_recall_at_1_std": -0.202864,
            "nauc_recall_at_1_diff1": -0.127,
            "nauc_recall_at_3_max": 0.375072,
            "nauc_recall_at_3_std": -0.009643,
            "nauc_recall_at_3_diff1": -0.089168,
            "nauc_recall_at_5_max": 0.147691,
            "nauc_recall_at_5_std": -0.128654,
            "nauc_recall_at_5_diff1": -0.084259,
            "nauc_recall_at_10_max": 0.141055,
            "nauc_recall_at_10_std": -0.165932,
            "nauc_recall_at_10_diff1": -0.060966,
            "nauc_recall_at_20_max": 0.043863,
            "nauc_recall_at_20_std": -0.028374,
            "nauc_recall_at_20_diff1": 0.157575,
            "nauc_recall_at_100_max": -0.157183,
            "nauc_recall_at_100_std": -0.019437,
            "nauc_recall_at_100_diff1": 0.013395,
            # "nauc_recall_at_1000_max": nan,
            # "nauc_recall_at_1000_std": nan,
            # "nauc_recall_at_1000_diff1": nan,
            "nauc_precision_at_1_max": 0.33969,
            "nauc_precision_at_1_std": -0.202864,
            "nauc_precision_at_1_diff1": -0.127,
            "nauc_precision_at_3_max": 0.406318,
            "nauc_precision_at_3_std": 0.007031,
            "nauc_precision_at_3_diff1": -0.034709,
            "nauc_precision_at_5_max": 0.178131,
            "nauc_precision_at_5_std": -0.112493,
            "nauc_precision_at_5_diff1": -0.045535,
            "nauc_precision_at_10_max": 0.167897,
            "nauc_precision_at_10_std": -0.150626,
            "nauc_precision_at_10_diff1": -0.027811,
            "nauc_precision_at_20_max": 0.081428,
            "nauc_precision_at_20_std": -0.042304,
            "nauc_precision_at_20_diff1": 0.17278,
            "nauc_precision_at_100_max": -0.150619,
            "nauc_precision_at_100_std": 0.016133,
            "nauc_precision_at_100_diff1": -0.065571,
            "nauc_precision_at_1000_max": -0.017244,
            "nauc_precision_at_1000_std": 0.046614,
            "nauc_precision_at_1000_diff1": -0.028258,
            "nauc_mrr_at_1_max": 0.33969,
            "nauc_mrr_at_1_std": -0.202864,
            "nauc_mrr_at_1_diff1": -0.127,
            "nauc_mrr_at_3_max": 0.409511,
            "nauc_mrr_at_3_std": -0.064671,
            "nauc_mrr_at_3_diff1": -0.01911,
            "nauc_mrr_at_5_max": 0.319584,
            "nauc_mrr_at_5_std": -0.103546,
            "nauc_mrr_at_5_diff1": -0.025109,
            "nauc_mrr_at_10_max": 0.309614,
            "nauc_mrr_at_10_std": -0.117564,
            "nauc_mrr_at_10_diff1": -0.019691,
            "nauc_mrr_at_20_max": 0.262976,
            "nauc_mrr_at_20_std": -0.092222,
            "nauc_mrr_at_20_diff1": 0.024507,
            "nauc_mrr_at_100_max": 0.256052,
            "nauc_mrr_at_100_std": -0.094249,
            "nauc_mrr_at_100_diff1": 0.012432,
            "nauc_mrr_at_1000_max": 0.260112,
            "nauc_mrr_at_1000_std": -0.098845,
            "nauc_mrr_at_1000_diff1": 0.009697,
            "main_score": 0.02837,
            "hf_subset": "default",
            "languages": ["dan-Latn"],
        }
    ]
}

# with update:
res[0].get_score()  # np.float64(0.02837)
res[0].scores
with_fix = {
    "train": [
        {
            "ndcg_at_1": 0.02597,
            "ndcg_at_3": 0.02213,
            "ndcg_at_5": 0.0262,
            "ndcg_at_10": 0.02837,
            "ndcg_at_20": 0.04548,
            "ndcg_at_100": 0.13527,
            "ndcg_at_1000": 0.24507,
            "map_at_1": 0.00866,
            "map_at_3": 0.01317,
            "map_at_5": 0.0149,
            "map_at_10": 0.01562,
            "map_at_20": 0.01898,
            "map_at_100": 0.02968,
            "map_at_1000": 0.03841,
            "recall_at_1": 0.00866,
            "recall_at_3": 0.02056,
            "recall_at_5": 0.02922,
            "recall_at_10": 0.03355,
            "recall_at_20": 0.08268,
            "recall_at_100": 0.43766,
            "recall_at_1000": 1.0,
            "precision_at_1": 0.02597,
            "precision_at_3": 0.02165,
            "precision_at_5": 0.01818,
            "precision_at_10": 0.01039,
            "precision_at_20": 0.01234,
            "precision_at_100": 0.01481,
            "precision_at_1000": 0.0034,
            "mrr_at_1": 0.025974,
            "mrr_at_3": 0.041126,
            "mrr_at_5": 0.04632,
            "mrr_at_10": 0.048485,
            "mrr_at_20": 0.058356,
            "mrr_at_100": 0.070186,
            "mrr_at_1000": 0.071349,
            "nauc_ndcg_at_1_max": 0.33969,
            "nauc_ndcg_at_1_std": -0.202864,
            "nauc_ndcg_at_1_diff1": -0.127,
            "nauc_ndcg_at_3_max": 0.409376,
            "nauc_ndcg_at_3_std": -0.039352,
            "nauc_ndcg_at_3_diff1": -0.022816,
            "nauc_ndcg_at_5_max": 0.250499,
            "nauc_ndcg_at_5_std": -0.115263,
            "nauc_ndcg_at_5_diff1": -0.057017,
            "nauc_ndcg_at_10_max": 0.238696,
            "nauc_ndcg_at_10_std": -0.138396,
            "nauc_ndcg_at_10_diff1": -0.045287,
            "nauc_ndcg_at_20_max": 0.154456,
            "nauc_ndcg_at_20_std": -0.070635,
            "nauc_ndcg_at_20_diff1": 0.074499,
            "nauc_ndcg_at_100_max": -0.005753,
            "nauc_ndcg_at_100_std": -0.074738,
            "nauc_ndcg_at_100_diff1": -0.005851,
            "nauc_ndcg_at_1000_max": 0.109439,
            "nauc_ndcg_at_1000_std": -0.089797,
            "nauc_ndcg_at_1000_diff1": -0.021634,
            "nauc_map_at_1_max": 0.33969,
            "nauc_map_at_1_std": -0.202864,
            "nauc_map_at_1_diff1": -0.127,
            "nauc_map_at_3_max": 0.385244,
            "nauc_map_at_3_std": -0.080638,
            "nauc_map_at_3_diff1": -0.060991,
            "nauc_map_at_5_max": 0.294871,
            "nauc_map_at_5_std": -0.119069,
            "nauc_map_at_5_diff1": -0.06234,
            "nauc_map_at_10_max": 0.285698,
            "nauc_map_at_10_std": -0.132856,
            "nauc_map_at_10_diff1": -0.055015,
            "nauc_map_at_20_max": 0.236619,
            "nauc_map_at_20_std": -0.100673,
            "nauc_map_at_20_diff1": -0.002619,
            "nauc_map_at_100_max": 0.15345,
            "nauc_map_at_100_std": -0.138888,
            "nauc_map_at_100_diff1": -0.02257,
            "nauc_map_at_1000_max": 0.171402,
            "nauc_map_at_1000_std": -0.134644,
            "nauc_map_at_1000_diff1": -0.034477,
            "nauc_recall_at_1_max": 0.33969,
            "nauc_recall_at_1_std": -0.202864,
            "nauc_recall_at_1_diff1": -0.127,
            "nauc_recall_at_3_max": 0.375072,
            "nauc_recall_at_3_std": -0.009643,
            "nauc_recall_at_3_diff1": -0.089168,
            "nauc_recall_at_5_max": 0.147691,
            "nauc_recall_at_5_std": -0.128654,
            "nauc_recall_at_5_diff1": -0.084259,
            "nauc_recall_at_10_max": 0.141055,
            "nauc_recall_at_10_std": -0.165932,
            "nauc_recall_at_10_diff1": -0.060966,
            "nauc_recall_at_20_max": 0.043863,
            "nauc_recall_at_20_std": -0.028374,
            "nauc_recall_at_20_diff1": 0.157575,
            "nauc_recall_at_100_max": -0.157183,
            "nauc_recall_at_100_std": -0.019437,
            "nauc_recall_at_100_diff1": 0.013395,
            # "nauc_recall_at_1000_max": nan,
            # "nauc_recall_at_1000_std": nan,
            # "nauc_recall_at_1000_diff1": nan,
            "nauc_precision_at_1_max": 0.33969,
            "nauc_precision_at_1_std": -0.202864,
            "nauc_precision_at_1_diff1": -0.127,
            "nauc_precision_at_3_max": 0.406318,
            "nauc_precision_at_3_std": 0.007031,
            "nauc_precision_at_3_diff1": -0.034709,
            "nauc_precision_at_5_max": 0.178131,
            "nauc_precision_at_5_std": -0.112493,
            "nauc_precision_at_5_diff1": -0.045535,
            "nauc_precision_at_10_max": 0.167897,
            "nauc_precision_at_10_std": -0.150626,
            "nauc_precision_at_10_diff1": -0.027811,
            "nauc_precision_at_20_max": 0.081428,
            "nauc_precision_at_20_std": -0.042304,
            "nauc_precision_at_20_diff1": 0.17278,
            "nauc_precision_at_100_max": -0.150619,
            "nauc_precision_at_100_std": 0.016133,
            "nauc_precision_at_100_diff1": -0.065571,
            "nauc_precision_at_1000_max": -0.017244,
            "nauc_precision_at_1000_std": 0.046614,
            "nauc_precision_at_1000_diff1": -0.028258,
            "nauc_mrr_at_1_max": 0.33969,
            "nauc_mrr_at_1_std": -0.202864,
            "nauc_mrr_at_1_diff1": -0.127,
            "nauc_mrr_at_3_max": 0.409511,
            "nauc_mrr_at_3_std": -0.064671,
            "nauc_mrr_at_3_diff1": -0.01911,
            "nauc_mrr_at_5_max": 0.319584,
            "nauc_mrr_at_5_std": -0.103546,
            "nauc_mrr_at_5_diff1": -0.025109,
            "nauc_mrr_at_10_max": 0.309614,
            "nauc_mrr_at_10_std": -0.117564,
            "nauc_mrr_at_10_diff1": -0.019691,
            "nauc_mrr_at_20_max": 0.262976,
            "nauc_mrr_at_20_std": -0.092222,
            "nauc_mrr_at_20_diff1": 0.024507,
            "nauc_mrr_at_100_max": 0.256052,
            "nauc_mrr_at_100_std": -0.094249,
            "nauc_mrr_at_100_diff1": 0.012432,
            "nauc_mrr_at_1000_max": 0.260112,
            "nauc_mrr_at_1000_std": -0.098845,
            "nauc_mrr_at_1000_diff1": 0.009697,
            "main_score": 0.02837,
            "hf_subset": "default",
            "languages": ["dan-Latn"],
        }
    ]
}

# check
with_fix == before_fix  # True

* restructure

* format

* relax pytrec versions

* fix incorrect parsing

* 1.38.44

Automatically generated by python-semantic-release

* Correcting the JINA models with SentenceTransformerWrapper (#3071)

* ci: Add stale workflow (#3066)

* add stale workflow

* add permissions

* add bug label to bug issue template

* revert bug issue and only look at more info needed issues

* more accurate name

* override default

* fix: open_clip package validation (#3073)

* 1.38.45

Automatically generated by python-semantic-release

* fix: Update revision for  qzhou models (#3069)

* 1.38.46

Automatically generated by python-semantic-release

* Fix the reference link for CoDi-Embedding-V1 (#3075)

Fix reference link

* fix: Add beta version of RTEB related benchmarks (#3048)

* Add RTEB related benchmarks

* Add RTEB related benchmarks

* Correcting the task names in the RTEB benchmarks

* Update mteb/leaderboard/benchmark_selector.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Adding the CURE dataset to RTEB benchmarks

* Use the right language subset

* Fix broken finance icon URL in RTEB benchmarks

Replace broken libre-finance-dollar.svg with working libre-gui-price-tag.svg
Validated all icon URLs and confirmed accessibility compliance

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* 1.38.47

Automatically generated by python-semantic-release

* fix: run `ruff check` on all files during ci (#3086)

* fix: run `ruff check` on all files during ci

* format

* 1.38.48

Automatically generated by python-semantic-release

* Move dev to dependency groups (#3088)

add dependency groups

* fix: Improving validate_task_to_prompt_name logs and error messages (#3079)

* Improving validate_task_to_prompt_name logs and error messages

* linter fixes

* Adding None prompts tests

* Update test_benchmark_sentence_transformer

* Update mteb/leaderboard/benchmark_selector.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: duplicate mteb multilingual variables (#3080)

* fix benchmark naming

* format

* lint

* Update tasks & benchmarks tables

* model: mdbr-leaf models (#3081)

* added MDBR leaf models

* fixed revision for mdbr-leaf-ir

* added model prompts

* updated training datasets

* fixed linting

* lotte task reference

---------

Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com>

* 1.38.49

Automatically generated by python-semantic-release

* CI: Set upper limit for xdist version  (#3098)

* Commentout bibtex formatting

* Remove `-n auto`

* get back bibtex

* try limiting versions

* revert coverage

* revert coverage

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Combine Plots and Tables into a Single (#3047)

* feat - Combine Plots and Tables into a Single Tab #3009

* feat - Resize the plot to make it more readable

* feat - Remove the (radar chart)

* feat - Add a comment stating that it only shows the Top 5 models in the table.

* feat - adjust layout

* Update mteb/leaderboard/app.py

* format

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* fix: Updating the default batch size calculation in the voyage models (#3091)

* 1.38.50

Automatically generated by python-semantic-release

* fix: Add @classmethod for @field_validators in TaskMetadata  (#3100)

* Align task prompt dict with `PromptType` (#3101)

* align task prompt dict with `PromptType`

* use value instead of enum

* 1.38.51

Automatically generated by python-semantic-release

* model: Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1 (#3090)

* Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1

* Add training_datasets (common_corpus, fineweb, wiki_fr, private LLM-synth)

* Format with ruff + add loader per review

* Apply ruff format/fixes

* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Register OrdalieTech/Solon-embeddings-mini-beta-1.1 in overview (ModelMeta + loader)

* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* fix import

* Add memory_usage_mb=808.0 and required fields to ModelMeta

* Fix 210 milions of parameters

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* fix: Allow closed datasets (#3059)

* - Added an include_private parameter to the get_tasks() function that defaults to False
  - This ensures that by default, tests only run on public datasets
  - Tests can explicitly set include_private=True when needed to test private datasets

  - Added is_public: bool | None = None field to TaskMetadata
  - The field is optional and defaults to None (treated as public)
  - Updated the is_filled() method to exclude is_public from required fields
  - Added documentation

* - Added an include_private parameter to the get_tasks() function that defaults to False
  - This ensures that by default, tests only run on public datasets
  - Tests can explicitly set include_private=True when needed to test private datasets

  - Added is_public: bool | None = None field to TaskMetadata
  - The field is optional and defaults to None (treated as public)
  - Updated the is_filled() method to exclude is_public from required fields
  - Added documentation

* Correcting due to comments

* Update mteb/abstasks/TaskMetadata.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/overview.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Removing the not used filter_tasks_by_privacy function

* Correcting due to comments

* Correcting due to comments

* Correcting due to comments

* Removing the test case

* Rename the include_private parameter to exclude_private

* Rename the include_private parameter to exclude_private

* Add private tasks tests

* Add private tasks tests

* Update tests/test_tasks/test_private_tasks.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Add private tasks tests

* Add private tasks tests

* Add private tasks tests

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* 1.38.52

Automatically generated by python-semantic-release

---------

Signed-off-by: admin <bo.wang@jina.ai>
Co-authored-by: Mohammad Kalim Akram <kalim.akram@jina.ai>
Co-authored-by: ItsukiFujii <42373615+ItsukiFujii@users.noreply.github.com>
Co-authored-by: xinshuohu <xinshuohu@tencent.com>
Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com>
Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Paul Teiletche <73120933+paultltc@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Alexey Vatolin <vatolinalex@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: lsz05 <shengzhe.li@sbintuitions.co.jp>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: zhichao-aws <zhichaog@amazon.com>
Co-authored-by: Abdur-Rahman Butler <79828536+abdurrahmanbutler@users.noreply.github.com>
Co-authored-by: Feiyang <feiyangc@google.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: semantic-release <semantic-release>
Co-authored-by: Nikolay Banar <nikc20008@gmail.com>
Co-authored-by: Penny Yu <51702222+PennyYu123@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: fzowl <zoltan@voyageai.com>
Co-authored-by: Bao Loc Pham <67360122+BaoLocPham@users.noreply.github.com>
Co-authored-by: Kritias <50093609+ElPlaguister@users.noreply.github.com>
Co-authored-by: roipony <roipony@gmail.com>
Co-authored-by: Aashka Trivedi <aashka.trivedi@gmail.com>
Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com>
Co-authored-by: admin <bo.wang@jina.ai>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Maximilian Werk <maximilian.werk@gmx.de>
Co-authored-by: Victor <zbwkeepgoing@126.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
Co-authored-by: Robin Vujanic <robin-vjc@users.noreply.github.com>
Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com>
Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com>
Co-authored-by: mathlesage <134429083+mathlesage@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Combine Plots and Tables into a Single Tab
5 participants