Merge main v2 07 10 #2895

Samoed · 2025-07-10T14:19:39Z

Merged main
Fixed implementations for Seed1.6, nvidia-llama
Added prompts_dict to AbsEncoder

* Update issue templates * Update bug_report.md * test yaml template * add templates * update templates * add emojis * fix typo * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * update issue titles * update PR template * remove PR templates --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* add model: geogpt_models * update geogpt_models * use InstructSentenceTransformerWrapper * resolve pylint warning * format geogpt_models.py * Update mteb/models/geogpt_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/geogpt_models.py --------- Co-authored-by: zhangzeqing <zhangzeqing@zhejianglab.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* add xyz model * add xyz model * add xyz model * update * update * update * update * update * update * update * lint --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

discussed in: #2796

* Add files via upload * Add files via upload * Update benchmarks.py * Update __init__.py * Add files via upload * Update R2MEDRetrieval.py * Update run_mteb_r2med.py * Delete scripts/run_mteb_r2med.py * Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Add files via upload * Delete mteb/descriptive_stats/Retrieval/R2MEDRetrieval.json * Add files via upload * Add files via upload * Add files via upload * Update R2MEDRetrieval.py * Add files via upload * Add files via upload * Add files via upload * Add files via upload * format citations * Update R2MEDRetrieval.py * Add files via upload * Add files via upload --------- Co-authored-by: Li Lei <34205771+ll0ruc@users.noreply.github.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

update training datasets Co-authored-by: zhangzeqing <zhangzeqing@zhejianglab.com>

* fix: Add adapted_from to Cmedqaretrieval Also snuck in a fix with form=None, which is no longer valid, but was still used in a few places. * format

Automatically generated by python-semantic-release

* Adding OpenAI client arg to init method (e.g., for already initialized AzureOpenAI client) To use OpenAI embedding models via Azure, the model wrapper needs to be initialized with a different client. * Update mteb/models/openai_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/openai_models.py * remove comment and format --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

Add LGAI-Embedding - Add mteb/models/lgai_embedding_models.py - defined model metadata

fixes #2811

Automatically generated by python-semantic-release

* add description to template * fix typo

* Added HIT-TMG_KaLM-embedding-multilingual-mini-instruct-v1 with instruct wrapper * Added KaLM_embedding_multilingual_mini_instruct_v1_5 * Added model to overview.py * Fix Task Count Per Language Table in tasks.md * resolve conflicts * remove tasks.md * Modified get_instruction funcion * Added support for prompt dict in get_instruction * fix lang code * Address comments * Delete mteb/models/check_models.py * added prompts_dict support in InstructSentenceTransformerWrapper * corrected instruction format * corrected prompts format * added correct instruction format * fix implementation * remove `if name main` * add comment --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix: Reuploaded previously unavailable SNL datasets closes #2477 * removed exceptions from tests * temp fixes * added temporary fix * clean up commented out code * format

Automatically generated by python-semantic-release

* Update usage.md * Update usage.md * Update docs/usage/usage.md --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* add custom instructions * fixed * lint * fix last instruction --------- Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* add Seed-1.6-embedding model * Update seed_1_6_embedding_models.py * update model meta info * support image encoder interface * error fix * fix: format seed_1_6_embedding_models.py with Ruff

* fix: Update model selection for the leaderboard fixes #2834 This removed the lower bound selection, but generally I don't think people should care about the models being too small. * fix 1M --> 1B * format * rename model_size -> max_model_size

Automatically generated by python-semantic-release

update seed1.6 model training data info

Automatically generated by python-semantic-release

* add model meta * linting * fix: add check for code lora * fix: apply review comments

* fix prompt validation * fix task name split correctly * add docstring for test

Automatically generated by python-semantic-release

* Adding Hinvec Model's Meta data. * Adding hinvec_model.py * Update mteb/models/hinvec_models.py Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * formated code with Black and lint with Ruff --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

Bump gradio

* nvidia_llama_nemoretriever_colembed * correct 3b reference * lint fix * add training data and license for nvidia/llama_nemoretriever_colembed * lint --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* fix sbert `v5` * add comment

* add listconranker modelmeta * fix bugs * use linter * lint --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* feat: add KaLM_Embedding_X_0605 in kalm_models * Update kalm_models.py for lint format --------- Co-authored-by: xinshuohu <xinshuohu@tencent.com>

comment kalm model

* Add JaCWIR and JQaRA for reranking * Fix ANLP Journal datasets * Add NLPJournalAbsArticleRetrieval and JaCWIRRetrieval * tackle test cases * Remove _evaluate_subset usage * Separate v1 and v2 * Update info for NLP Journal datasets

* add tooka v2s * add mcinext models * update mcinext.py * Apply PR review suggestions * Update mteb/models/mcinext_models.py --------- Co-authored-by: mehran <mehan.sarmadi16@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Added DadoEvalCoarseClassification * Removed unnecessary columns from DadoEvalCoarseClassification * Added EmitClassification task * added SardiStanceClassification task * Added GeoLingItClassification task * Added DisCoTexPairClassification tasks * Added EmitClassification, DadoEvalCoarseClassification, GeoLingItClassification, SardiStanceClassification inside the inits * changed import in DisCoTexPairClassification * removed GeoLingItClassification dataset * fixed citation formatting, missing metadata parameters and lint formatting * - Added XGlueWRPReranking task - Added missing __init__.py files * fixed metadata in XGlueWRPReranking * Added MKQARetrieval task * fixed type in XGlueWRPReranking * changed MKQARetrieval from cross-lingual to monolingual * formatted MKQARetrieval file * removed unused const --------- Co-authored-by: Mattia Sangermano <MattiaSangermano@users.noreply.huggingface.co>

fix datasets version

Automatically generated by python-semantic-release

# Conflicts: # docs/create_tasks_table.py # docs/tasks.md # docs/usage/usage.md # mteb/evaluation/evaluators/RetrievalEvaluator.py # mteb/models/instruct_wrapper.py # mteb/models/model_implementations/jina_models.py # mteb/models/model_implementations/misc_models.py # mteb/models/model_implementations/openai_models.py # mteb/models/model_implementations/ru_sentence_models.py # mteb/models/overview.py # mteb/models/wrapper.py # mteb/tasks/Classification/__init__.py # mteb/tasks/Clustering/nob/snl_clustering.py # mteb/tasks/MultiLabelClassification/__init__.py # mteb/tasks/PairClassification/__init__.py # mteb/tasks/Reranking/__init__.py # mteb/tasks/Retrieval/__init__.py # mteb/tasks/Retrieval/eng/R2MEDRetrieval.py # pyproject.toml # tests/test_benchmark/mock_models.py # tests/test_benchmark/test_benchmark.py

Samoed and others added 30 commits June 8, 2025 22:30

bump ruff (#2784)

9e2e972

ci: fix config error for semantic release (#2800)

3d8dd9e

discussed in: #2796

Update tasks & benchmarks tables

5e6aa9d

Update training datasets of GeoGPT-Research-Project/GeoEmbedding (#2802)

36a3c67

update training datasets Co-authored-by: zhangzeqing <zhangzeqing@zhejianglab.com>

fix: Add adapted_from to Cmedqaretrieval (#2806)

fef1837

* fix: Add adapted_from to Cmedqaretrieval Also snuck in a fix with form=None, which is no longer valid, but was still used in a few places. * format

1.38.28

e6238f2

Automatically generated by python-semantic-release

model: Add annamodels/LGAI-Embedding-Preview (#2810)

3e291f3

Add LGAI-Embedding - Add mteb/models/lgai_embedding_models.py - defined model metadata

fix: Ensure bright uses the correct revision (#2812)

56dc620

fixes #2811

1.38.29

9fc0c3d

Automatically generated by python-semantic-release

add description to issue template (#2817)

04c9511

* add description to template * fix typo

fix: Reuploaded previously unavailable SNL datasets (#2819)

c790269

* fix: Reuploaded previously unavailable SNL datasets closes #2477 * removed exceptions from tests * temp fixes * added temporary fix * clean up commented out code * format

Update tasks & benchmarks tables

74d17b2

1.38.30

dcdc16a

Automatically generated by python-semantic-release

docs: Fix some typos in docs/usage/usage.md (#2835)

774a942

* Update usage.md * Update usage.md * Update docs/usage/usage.md --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

model: Add custom instructions for GigaEmbeddings (#2836)

d7ff1ab

* add custom instructions * fixed * lint * fix last instruction --------- Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

model: add Seed-1.6-embedding model (#2841)

8851bf0

* add Seed-1.6-embedding model * Update seed_1_6_embedding_models.py * update model meta info * support image encoder interface * error fix * fix: format seed_1_6_embedding_models.py with Ruff

1.38.31

642898f

Automatically generated by python-semantic-release

fix: update training dataset info of Seed-1.6-embedding model (#2857)

a8214e2

update seed1.6 model training data info

1.38.32

82844eb

Automatically generated by python-semantic-release

add jinav4 model meta (#2858)

f1d560a

* add model meta * linting * fix: add check for code lora * fix: apply review comments

fix: prompt validation for tasks with - (#2846)

430357c

* fix prompt validation * fix task name split correctly * add docstring for test

1.38.33

9fed3e5

Automatically generated by python-semantic-release

Samoed and others added 16 commits June 28, 2025 11:25

Bump gradio to fix leaderboard sorting (#2866)

a4388c2

Bump gradio

model: Adding nvidia/llama-nemoretriever-colembed models (#2861)

4ff1413

* nvidia_llama_nemoretriever_colembed * correct 3b reference * lint fix * add training data and license for nvidia/llama_nemoretriever_colembed * lint --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

rename seed-1.6-embedding to seed1.6-embedding (#2870)

f27648b

fix tests to be compatible with SentenceTransformers v5 (#2875)

f346a37

* fix sbert `v5` * add comment

model: add listconranker modelmeta (#2874)

5846f56

* add listconranker modelmeta * fix bugs * use linter * lint --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

model: add kalm_models ModelMeta (new PR) (#2853)

b67bd04

* feat: add KaLM_Embedding_X_0605 in kalm_models * Update kalm_models.py for lint format --------- Co-authored-by: xinshuohu <xinshuohu@tencent.com>

Comment kalm model (#2877)

a3ca95c

comment kalm model

Update tasks & benchmarks tables

5be02c1

Update tasks & benchmarks tables

5303fec

fix: pin datasets version (#2892)

00c95cf

fix datasets version

1.38.34

cfa27d7

Automatically generated by python-semantic-release

fix model implementations

0b6fcae

Samoed requested review from KennethEnevoldsen and isaac-chung July 10, 2025 14:19

Samoed added 2 commits July 10, 2025 17:35

fix tasks

141fca0

add metrics

8285279

isaac-chung approved these changes Jul 10, 2025

View reviewed changes

Samoed merged commit a23e2eb into v2.0.0 Jul 10, 2025
9 checks passed

Samoed deleted the merge_main_v2_07_10 branch July 10, 2025 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge main v2 07 10 #2895

Merge main v2 07 10 #2895

Samoed commented Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

Merge main v2 07 10 #2895

Merge main v2 07 10 #2895

Conversation

Samoed commented Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!