Skip to content

Conversation

Samoed
Copy link
Member

@Samoed Samoed commented Aug 26, 2025

If you add a model or a dataset, please add the corresponding checklist:

makram93 and others added 30 commits July 11, 2025 22:06
* feat: unify text and image embeddings for all tasks

* fix: uniform batch size

* fix: update error message

* fix: update code task

* fix: update max length

* fix: apply review suggestions
* feat: add KaLM_Embedding_X_0605 in kalm_models

* Update kalm_models.py for lint format

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

---------

Co-authored-by: xinshuohu <xinshuohu@tencent.com>
Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com>
* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* adding vidore benchmarks

* fix typo

* clean vidore names + per lang eval

* lint

* vidore names

* bibtex fix

* fix revision

* vidore v2 citation

* update citation format and fix per-language mappings

* lint: citations

* typo citations

* fix revisiions

* lint

* fix colnomic3b revision

* fix colqwen2.5 revision + latest repo version

* fix query agmentation tokens

* colsmol revision
Automatically generated by python-semantic-release
* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

* Adding STSEvaluator and SummarizationEvaluator tests

* Correcting due to the comments

* Correcting due to the comments

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Classification dataset cleaning

* Update pull request number

* Fix metadata test

* fix formatting

* add script for cleaning
Add JapaneseSentimentClassification
* change document to passage

* fix prompt names

* fix kwargs check

* fix default prompt
Automatically generated by python-semantic-release
add opensearch inf-free models

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Add BareExamQA retrieval task

* ran linter

* updated details

* updated details

* fixed subtype name

* fixed changes

* ran linter again
specify revision for opensearch
Automatically generated by python-semantic-release
… been checked (#2940)

* fix: Only import SparseEncoder once sentence-transformer version have been checked

fixes #2936

* Update mteb/models/opensearch_neural_sparse_models.py

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
…2939)

The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue.
* docs: Update adding_a_dataset.md

* Update docs/adding_a_dataset.md
Automatically generated by python-semantic-release
* BSARD loader fixed

* BSARDv2 metadata fixed

* Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Added govreport task

* Updated description
* Added BillSum datasets

* fixed billsumca

* Updated BillSumCA description

* Updated BillSumUS description

* Update mteb/tasks/Retrieval/eng/BillSumCA.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/BillSumUS.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* lint

* lint

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
…2716)

* Add RuSciBench

* fix bitext mining lang

* Add regression task

* fix init

* add missing files

* Improve description

* Add superseded_by

* fix lint

* Update regression task to match with v2

* Add stratified_subsampling for regression task

* Add boostrap for regression task

* Rename task class, add model as evaluator argument

* fix import

* fix import 2

* fixes

* fix

* Rename regression model protocol
fzoll and others added 20 commits August 21, 2025 11:56
* Add FreshStackRetrieval

* Reformatting, correcting the revision

* Dataset correction
* Add DS1000 retrieval task

- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries

* Add DS1000Retrieval to imports

* Add descriptive statistics for DS1000Retrieval

* Reformatting

* Reformatting
* Add ChatDoctorRetrieval

* Reformatting, correcting the revision

* Correct the dataset citation

* Correcting due to comments
* Add DS1000 retrieval task

- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries

* Add DS1000Retrieval to imports

* Add descriptive statistics for DS1000Retrieval

* Reformatting

* Reformatting

* Add DS1000Retrieval task implementation
* feat: added jinavdr benchmark

* feat: added description for jinavdr

* feat: fixed licenses and added bibtex

* feat: made jinav4 compatible with vidore benchmark

* feat: corrected query numbers

* feat: removed print

* feat: added max pixel argument for jina models

* feat: score calculation on cpu

* feat: adjust jina model for new mteb code

* feat: code cleanup

* feat: corrected bibtex

* feat: make colpali run with jinavdr

* feat: fixed comments

* feat: better reference and fixed comments

* feat: added date for tasks

* feat: fixed missing metadata and bibtex

* feat: added descriptions per dataset
* add codiemb-minicpm

* replace codiemb_minicpm with codi_model

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* update code

* update code

* reformat

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
)

* fix: ensure that there are always relevant docs attached to query

Here is brief test that it doesn't influence scores:
```py
t1 = mteb.get_task("TwitterHjerneRetrieval")
meta = mteb.get_model_meta("minishlab/potion-base-2M")

eval = mteb.MTEB(tasks=[t1])
res = eval.run(model=meta.load_model())

# before fix:
res[0].get_score()  # np.float64(0.02837)
res[0].scores
before_fix = {
    "train": [
        {
            "ndcg_at_1": 0.02597,
            "ndcg_at_3": 0.02213,
            "ndcg_at_5": 0.0262,
            "ndcg_at_10": 0.02837,
            "ndcg_at_20": 0.04548,
            "ndcg_at_100": 0.13527,
            "ndcg_at_1000": 0.24507,
            "map_at_1": 0.00866,
            "map_at_3": 0.01317,
            "map_at_5": 0.0149,
            "map_at_10": 0.01562,
            "map_at_20": 0.01898,
            "map_at_100": 0.02968,
            "map_at_1000": 0.03841,
            "recall_at_1": 0.00866,
            "recall_at_3": 0.02056,
            "recall_at_5": 0.02922,
            "recall_at_10": 0.03355,
            "recall_at_20": 0.08268,
            "recall_at_100": 0.43766,
            "recall_at_1000": 1.0,
            "precision_at_1": 0.02597,
            "precision_at_3": 0.02165,
            "precision_at_5": 0.01818,
            "precision_at_10": 0.01039,
            "precision_at_20": 0.01234,
            "precision_at_100": 0.01481,
            "precision_at_1000": 0.0034,
            "mrr_at_1": 0.025974,
            "mrr_at_3": 0.041126,
            "mrr_at_5": 0.04632,
            "mrr_at_10": 0.048485,
            "mrr_at_20": 0.058356,
            "mrr_at_100": 0.070186,
            "mrr_at_1000": 0.071349,
            "nauc_ndcg_at_1_max": 0.33969,
            "nauc_ndcg_at_1_std": -0.202864,
            "nauc_ndcg_at_1_diff1": -0.127,
            "nauc_ndcg_at_3_max": 0.409376,
            "nauc_ndcg_at_3_std": -0.039352,
            "nauc_ndcg_at_3_diff1": -0.022816,
            "nauc_ndcg_at_5_max": 0.250499,
            "nauc_ndcg_at_5_std": -0.115263,
            "nauc_ndcg_at_5_diff1": -0.057017,
            "nauc_ndcg_at_10_max": 0.238696,
            "nauc_ndcg_at_10_std": -0.138396,
            "nauc_ndcg_at_10_diff1": -0.045287,
            "nauc_ndcg_at_20_max": 0.154456,
            "nauc_ndcg_at_20_std": -0.070635,
            "nauc_ndcg_at_20_diff1": 0.074499,
            "nauc_ndcg_at_100_max": -0.005753,
            "nauc_ndcg_at_100_std": -0.074738,
            "nauc_ndcg_at_100_diff1": -0.005851,
            "nauc_ndcg_at_1000_max": 0.109439,
            "nauc_ndcg_at_1000_std": -0.089797,
            "nauc_ndcg_at_1000_diff1": -0.021634,
            "nauc_map_at_1_max": 0.33969,
            "nauc_map_at_1_std": -0.202864,
            "nauc_map_at_1_diff1": -0.127,
            "nauc_map_at_3_max": 0.385244,
            "nauc_map_at_3_std": -0.080638,
            "nauc_map_at_3_diff1": -0.060991,
            "nauc_map_at_5_max": 0.294871,
            "nauc_map_at_5_std": -0.119069,
            "nauc_map_at_5_diff1": -0.06234,
            "nauc_map_at_10_max": 0.285698,
            "nauc_map_at_10_std": -0.132856,
            "nauc_map_at_10_diff1": -0.055015,
            "nauc_map_at_20_max": 0.236619,
            "nauc_map_at_20_std": -0.100673,
            "nauc_map_at_20_diff1": -0.002619,
            "nauc_map_at_100_max": 0.15345,
            "nauc_map_at_100_std": -0.138888,
            "nauc_map_at_100_diff1": -0.02257,
            "nauc_map_at_1000_max": 0.171402,
            "nauc_map_at_1000_std": -0.134644,
            "nauc_map_at_1000_diff1": -0.034477,
            "nauc_recall_at_1_max": 0.33969,
            "nauc_recall_at_1_std": -0.202864,
            "nauc_recall_at_1_diff1": -0.127,
            "nauc_recall_at_3_max": 0.375072,
            "nauc_recall_at_3_std": -0.009643,
            "nauc_recall_at_3_diff1": -0.089168,
            "nauc_recall_at_5_max": 0.147691,
            "nauc_recall_at_5_std": -0.128654,
            "nauc_recall_at_5_diff1": -0.084259,
            "nauc_recall_at_10_max": 0.141055,
            "nauc_recall_at_10_std": -0.165932,
            "nauc_recall_at_10_diff1": -0.060966,
            "nauc_recall_at_20_max": 0.043863,
            "nauc_recall_at_20_std": -0.028374,
            "nauc_recall_at_20_diff1": 0.157575,
            "nauc_recall_at_100_max": -0.157183,
            "nauc_recall_at_100_std": -0.019437,
            "nauc_recall_at_100_diff1": 0.013395,
            # "nauc_recall_at_1000_max": nan,
            # "nauc_recall_at_1000_std": nan,
            # "nauc_recall_at_1000_diff1": nan,
            "nauc_precision_at_1_max": 0.33969,
            "nauc_precision_at_1_std": -0.202864,
            "nauc_precision_at_1_diff1": -0.127,
            "nauc_precision_at_3_max": 0.406318,
            "nauc_precision_at_3_std": 0.007031,
            "nauc_precision_at_3_diff1": -0.034709,
            "nauc_precision_at_5_max": 0.178131,
            "nauc_precision_at_5_std": -0.112493,
            "nauc_precision_at_5_diff1": -0.045535,
            "nauc_precision_at_10_max": 0.167897,
            "nauc_precision_at_10_std": -0.150626,
            "nauc_precision_at_10_diff1": -0.027811,
            "nauc_precision_at_20_max": 0.081428,
            "nauc_precision_at_20_std": -0.042304,
            "nauc_precision_at_20_diff1": 0.17278,
            "nauc_precision_at_100_max": -0.150619,
            "nauc_precision_at_100_std": 0.016133,
            "nauc_precision_at_100_diff1": -0.065571,
            "nauc_precision_at_1000_max": -0.017244,
            "nauc_precision_at_1000_std": 0.046614,
            "nauc_precision_at_1000_diff1": -0.028258,
            "nauc_mrr_at_1_max": 0.33969,
            "nauc_mrr_at_1_std": -0.202864,
            "nauc_mrr_at_1_diff1": -0.127,
            "nauc_mrr_at_3_max": 0.409511,
            "nauc_mrr_at_3_std": -0.064671,
            "nauc_mrr_at_3_diff1": -0.01911,
            "nauc_mrr_at_5_max": 0.319584,
            "nauc_mrr_at_5_std": -0.103546,
            "nauc_mrr_at_5_diff1": -0.025109,
            "nauc_mrr_at_10_max": 0.309614,
            "nauc_mrr_at_10_std": -0.117564,
            "nauc_mrr_at_10_diff1": -0.019691,
            "nauc_mrr_at_20_max": 0.262976,
            "nauc_mrr_at_20_std": -0.092222,
            "nauc_mrr_at_20_diff1": 0.024507,
            "nauc_mrr_at_100_max": 0.256052,
            "nauc_mrr_at_100_std": -0.094249,
            "nauc_mrr_at_100_diff1": 0.012432,
            "nauc_mrr_at_1000_max": 0.260112,
            "nauc_mrr_at_1000_std": -0.098845,
            "nauc_mrr_at_1000_diff1": 0.009697,
            "main_score": 0.02837,
            "hf_subset": "default",
            "languages": ["dan-Latn"],
        }
    ]
}

# with update:
res[0].get_score()  # np.float64(0.02837)
res[0].scores
with_fix = {
    "train": [
        {
            "ndcg_at_1": 0.02597,
            "ndcg_at_3": 0.02213,
            "ndcg_at_5": 0.0262,
            "ndcg_at_10": 0.02837,
            "ndcg_at_20": 0.04548,
            "ndcg_at_100": 0.13527,
            "ndcg_at_1000": 0.24507,
            "map_at_1": 0.00866,
            "map_at_3": 0.01317,
            "map_at_5": 0.0149,
            "map_at_10": 0.01562,
            "map_at_20": 0.01898,
            "map_at_100": 0.02968,
            "map_at_1000": 0.03841,
            "recall_at_1": 0.00866,
            "recall_at_3": 0.02056,
            "recall_at_5": 0.02922,
            "recall_at_10": 0.03355,
            "recall_at_20": 0.08268,
            "recall_at_100": 0.43766,
            "recall_at_1000": 1.0,
            "precision_at_1": 0.02597,
            "precision_at_3": 0.02165,
            "precision_at_5": 0.01818,
            "precision_at_10": 0.01039,
            "precision_at_20": 0.01234,
            "precision_at_100": 0.01481,
            "precision_at_1000": 0.0034,
            "mrr_at_1": 0.025974,
            "mrr_at_3": 0.041126,
            "mrr_at_5": 0.04632,
            "mrr_at_10": 0.048485,
            "mrr_at_20": 0.058356,
            "mrr_at_100": 0.070186,
            "mrr_at_1000": 0.071349,
            "nauc_ndcg_at_1_max": 0.33969,
            "nauc_ndcg_at_1_std": -0.202864,
            "nauc_ndcg_at_1_diff1": -0.127,
            "nauc_ndcg_at_3_max": 0.409376,
            "nauc_ndcg_at_3_std": -0.039352,
            "nauc_ndcg_at_3_diff1": -0.022816,
            "nauc_ndcg_at_5_max": 0.250499,
            "nauc_ndcg_at_5_std": -0.115263,
            "nauc_ndcg_at_5_diff1": -0.057017,
            "nauc_ndcg_at_10_max": 0.238696,
            "nauc_ndcg_at_10_std": -0.138396,
            "nauc_ndcg_at_10_diff1": -0.045287,
            "nauc_ndcg_at_20_max": 0.154456,
            "nauc_ndcg_at_20_std": -0.070635,
            "nauc_ndcg_at_20_diff1": 0.074499,
            "nauc_ndcg_at_100_max": -0.005753,
            "nauc_ndcg_at_100_std": -0.074738,
            "nauc_ndcg_at_100_diff1": -0.005851,
            "nauc_ndcg_at_1000_max": 0.109439,
            "nauc_ndcg_at_1000_std": -0.089797,
            "nauc_ndcg_at_1000_diff1": -0.021634,
            "nauc_map_at_1_max": 0.33969,
            "nauc_map_at_1_std": -0.202864,
            "nauc_map_at_1_diff1": -0.127,
            "nauc_map_at_3_max": 0.385244,
            "nauc_map_at_3_std": -0.080638,
            "nauc_map_at_3_diff1": -0.060991,
            "nauc_map_at_5_max": 0.294871,
            "nauc_map_at_5_std": -0.119069,
            "nauc_map_at_5_diff1": -0.06234,
            "nauc_map_at_10_max": 0.285698,
            "nauc_map_at_10_std": -0.132856,
            "nauc_map_at_10_diff1": -0.055015,
            "nauc_map_at_20_max": 0.236619,
            "nauc_map_at_20_std": -0.100673,
            "nauc_map_at_20_diff1": -0.002619,
            "nauc_map_at_100_max": 0.15345,
            "nauc_map_at_100_std": -0.138888,
            "nauc_map_at_100_diff1": -0.02257,
            "nauc_map_at_1000_max": 0.171402,
            "nauc_map_at_1000_std": -0.134644,
            "nauc_map_at_1000_diff1": -0.034477,
            "nauc_recall_at_1_max": 0.33969,
            "nauc_recall_at_1_std": -0.202864,
            "nauc_recall_at_1_diff1": -0.127,
            "nauc_recall_at_3_max": 0.375072,
            "nauc_recall_at_3_std": -0.009643,
            "nauc_recall_at_3_diff1": -0.089168,
            "nauc_recall_at_5_max": 0.147691,
            "nauc_recall_at_5_std": -0.128654,
            "nauc_recall_at_5_diff1": -0.084259,
            "nauc_recall_at_10_max": 0.141055,
            "nauc_recall_at_10_std": -0.165932,
            "nauc_recall_at_10_diff1": -0.060966,
            "nauc_recall_at_20_max": 0.043863,
            "nauc_recall_at_20_std": -0.028374,
            "nauc_recall_at_20_diff1": 0.157575,
            "nauc_recall_at_100_max": -0.157183,
            "nauc_recall_at_100_std": -0.019437,
            "nauc_recall_at_100_diff1": 0.013395,
            # "nauc_recall_at_1000_max": nan,
            # "nauc_recall_at_1000_std": nan,
            # "nauc_recall_at_1000_diff1": nan,
            "nauc_precision_at_1_max": 0.33969,
            "nauc_precision_at_1_std": -0.202864,
            "nauc_precision_at_1_diff1": -0.127,
            "nauc_precision_at_3_max": 0.406318,
            "nauc_precision_at_3_std": 0.007031,
            "nauc_precision_at_3_diff1": -0.034709,
            "nauc_precision_at_5_max": 0.178131,
            "nauc_precision_at_5_std": -0.112493,
            "nauc_precision_at_5_diff1": -0.045535,
            "nauc_precision_at_10_max": 0.167897,
            "nauc_precision_at_10_std": -0.150626,
            "nauc_precision_at_10_diff1": -0.027811,
            "nauc_precision_at_20_max": 0.081428,
            "nauc_precision_at_20_std": -0.042304,
            "nauc_precision_at_20_diff1": 0.17278,
            "nauc_precision_at_100_max": -0.150619,
            "nauc_precision_at_100_std": 0.016133,
            "nauc_precision_at_100_diff1": -0.065571,
            "nauc_precision_at_1000_max": -0.017244,
            "nauc_precision_at_1000_std": 0.046614,
            "nauc_precision_at_1000_diff1": -0.028258,
            "nauc_mrr_at_1_max": 0.33969,
            "nauc_mrr_at_1_std": -0.202864,
            "nauc_mrr_at_1_diff1": -0.127,
            "nauc_mrr_at_3_max": 0.409511,
            "nauc_mrr_at_3_std": -0.064671,
            "nauc_mrr_at_3_diff1": -0.01911,
            "nauc_mrr_at_5_max": 0.319584,
            "nauc_mrr_at_5_std": -0.103546,
            "nauc_mrr_at_5_diff1": -0.025109,
            "nauc_mrr_at_10_max": 0.309614,
            "nauc_mrr_at_10_std": -0.117564,
            "nauc_mrr_at_10_diff1": -0.019691,
            "nauc_mrr_at_20_max": 0.262976,
            "nauc_mrr_at_20_std": -0.092222,
            "nauc_mrr_at_20_diff1": 0.024507,
            "nauc_mrr_at_100_max": 0.256052,
            "nauc_mrr_at_100_std": -0.094249,
            "nauc_mrr_at_100_diff1": 0.012432,
            "nauc_mrr_at_1000_max": 0.260112,
            "nauc_mrr_at_1000_std": -0.098845,
            "nauc_mrr_at_1000_diff1": 0.009697,
            "main_score": 0.02837,
            "hf_subset": "default",
            "languages": ["dan-Latn"],
        }
    ]
}

# check
with_fix == before_fix  # True

* restructure

* format

* relax pytrec versions

* fix incorrect parsing
Automatically generated by python-semantic-release
* add stale workflow

* add permissions

* add bug label to bug issue template

* revert bug issue and only look at more info needed issues

* more accurate name

* override default
Automatically generated by python-semantic-release
Automatically generated by python-semantic-release
# Conflicts:
#	mteb/abstasks/TaskMetadata.py
#	mteb/abstasks/__init__.py
#	mteb/models/overview.py
#	pyproject.toml
@Samoed Samoed requested a review from isaac-chung August 26, 2025 07:24
@isaac-chung
Copy link
Collaborator

Oh?

FAILED tests/test_tasks/test_maeb_datasets.py::test_benchmark_audio_encoder[model0-task5] - AttributeError: passage

@isaac-chung
Copy link
Collaborator

Sweet. Looks like the only missing thing is linting. Otherwise good to go.

@Samoed
Copy link
Member Author

Samoed commented Aug 26, 2025

Done, missed that

@isaac-chung isaac-chung merged commit 4b992c9 into maeb Aug 26, 2025
9 checks passed
@isaac-chung isaac-chung deleted the maeb_main_merge_26_08 branch August 26, 2025 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.