Skip to content

Merge main v2 07 10 #2895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Jul 10, 2025
Merged

Merge main v2 07 10 #2895

merged 48 commits into from
Jul 10, 2025

Conversation

Samoed
Copy link
Member

@Samoed Samoed commented Jul 10, 2025

  • Merged main
  • Fixed implementations for Seed1.6, nvidia-llama
  • Added prompts_dict to AbsEncoder

Samoed and others added 30 commits June 8, 2025 22:30
* Update issue templates

* Update bug_report.md

* test yaml template

* add templates

* update templates

* add emojis

* fix typo

* Apply suggestions from code review

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* update issue titles

* update PR template

* remove PR templates

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* add model: geogpt_models

* update geogpt_models

* use InstructSentenceTransformerWrapper

* resolve pylint warning

* format geogpt_models.py

* Update mteb/models/geogpt_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/geogpt_models.py

---------

Co-authored-by: zhangzeqing <zhangzeqing@zhejianglab.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* add xyz model

* add xyz model

* add xyz model

* update

* update

* update

* update

* update

* update

* update

* lint

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Add files via upload

* Add files via upload

* Update benchmarks.py

* Update __init__.py

* Add files via upload

* Update R2MEDRetrieval.py

* Update run_mteb_r2med.py

* Delete scripts/run_mteb_r2med.py

* Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Add files via upload

* Delete mteb/descriptive_stats/Retrieval/R2MEDRetrieval.json

* Add files via upload

* Add files via upload

* Add files via upload

* Update R2MEDRetrieval.py

* Add files via upload

* Add files via upload

* Add files via upload

* Add files via upload

* format citations

* Update R2MEDRetrieval.py

* Add files via upload

* Add files via upload

---------

Co-authored-by: Li Lei <34205771+ll0ruc@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
update training datasets

Co-authored-by: zhangzeqing <zhangzeqing@zhejianglab.com>
* fix: Add adapted_from to Cmedqaretrieval

Also snuck in a fix with form=None, which is no longer valid, but was still used in a few places.

* format
Automatically generated by python-semantic-release
* Adding OpenAI client arg to init method (e.g., for already initialized AzureOpenAI client)

To use OpenAI embedding models via Azure, the model wrapper needs to be initialized with a different client.

* Update mteb/models/openai_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/openai_models.py

* remove comment and format

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Add LGAI-Embedding

- Add mteb/models/lgai_embedding_models.py

- defined model metadata
Automatically generated by python-semantic-release
* add description to template

* fix typo
* Added HIT-TMG_KaLM-embedding-multilingual-mini-instruct-v1 with instruct wrapper

* Added KaLM_embedding_multilingual_mini_instruct_v1_5

* Added model to overview.py

* Fix Task Count Per Language Table in tasks.md

* resolve conflicts

* remove tasks.md

* Modified get_instruction funcion

* Added support for prompt dict in get_instruction

* fix lang code

* Address comments

* Delete mteb/models/check_models.py

* added prompts_dict support in InstructSentenceTransformerWrapper

* corrected instruction format

* corrected prompts format

* added correct instruction format

* fix implementation

* remove `if name main`

* add comment

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
* fix: Reuploaded previously unavailable SNL datasets

closes #2477

* removed exceptions from tests

* temp fixes

* added temporary fix

* clean up commented out code

* format
Automatically generated by python-semantic-release
* Update usage.md

* Update usage.md

* Update docs/usage/usage.md

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* add custom instructions

* fixed

* lint

* fix last instruction

---------

Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
* add Seed-1.6-embedding model

* Update seed_1_6_embedding_models.py

* update model meta info

* support image encoder interface

* error fix

* fix: format seed_1_6_embedding_models.py with Ruff
* fix: Update model selection for the leaderboard

fixes #2834

This removed the lower bound selection, but generally I don't think people should care about the models being too small.

* fix 1M --> 1B

* format

* rename model_size -> max_model_size
Automatically generated by python-semantic-release
Automatically generated by python-semantic-release
* add model meta

* linting

* fix: add check for code lora

* fix: apply review comments
* fix prompt validation

* fix task name split correctly

* add docstring for test
Automatically generated by python-semantic-release
* Adding Hinvec Model's Meta data.

* Adding hinvec_model.py

* Update mteb/models/hinvec_models.py

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* formated code with Black and lint with Ruff

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Samoed and others added 16 commits June 28, 2025 11:25
* nvidia_llama_nemoretriever_colembed

* correct 3b reference

* lint fix

* add training data and license for nvidia/llama_nemoretriever_colembed

* lint

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* add listconranker modelmeta

* fix bugs

* use linter

* lint

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
* feat: add KaLM_Embedding_X_0605 in kalm_models

* Update kalm_models.py for lint format

---------

Co-authored-by: xinshuohu <xinshuohu@tencent.com>
comment kalm model
* Add JaCWIR and JQaRA for reranking

* Fix ANLP Journal datasets

* Add NLPJournalAbsArticleRetrieval and JaCWIRRetrieval

* tackle test cases

* Remove _evaluate_subset usage

* Separate v1 and v2

* Update info for NLP Journal datasets
* add tooka v2s

* add mcinext models

* update mcinext.py

* Apply PR review suggestions

* Update mteb/models/mcinext_models.py

---------

Co-authored-by: mehran <mehan.sarmadi16@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Added DadoEvalCoarseClassification

* Removed unnecessary columns from DadoEvalCoarseClassification

* Added EmitClassification task

* added SardiStanceClassification task

* Added GeoLingItClassification task

* Added DisCoTexPairClassification tasks

* Added EmitClassification, DadoEvalCoarseClassification, GeoLingItClassification, SardiStanceClassification inside the inits

* changed import in DisCoTexPairClassification

* removed GeoLingItClassification dataset

* fixed citation formatting, missing metadata parameters and lint formatting

* - Added XGlueWRPReranking task
- Added missing __init__.py files

* fixed metadata in XGlueWRPReranking

* Added MKQARetrieval task

* fixed type in XGlueWRPReranking

* changed MKQARetrieval from  cross-lingual to monolingual

* formatted MKQARetrieval file

* removed unused const

---------

Co-authored-by: Mattia Sangermano <MattiaSangermano@users.noreply.huggingface.co>
Automatically generated by python-semantic-release
# Conflicts:
#	docs/create_tasks_table.py
#	docs/tasks.md
#	docs/usage/usage.md
#	mteb/evaluation/evaluators/RetrievalEvaluator.py
#	mteb/models/instruct_wrapper.py
#	mteb/models/model_implementations/jina_models.py
#	mteb/models/model_implementations/misc_models.py
#	mteb/models/model_implementations/openai_models.py
#	mteb/models/model_implementations/ru_sentence_models.py
#	mteb/models/overview.py
#	mteb/models/wrapper.py
#	mteb/tasks/Classification/__init__.py
#	mteb/tasks/Clustering/nob/snl_clustering.py
#	mteb/tasks/MultiLabelClassification/__init__.py
#	mteb/tasks/PairClassification/__init__.py
#	mteb/tasks/Reranking/__init__.py
#	mteb/tasks/Retrieval/__init__.py
#	mteb/tasks/Retrieval/eng/R2MEDRetrieval.py
#	pyproject.toml
#	tests/test_benchmark/mock_models.py
#	tests/test_benchmark/test_benchmark.py
@Samoed Samoed merged commit a23e2eb into v2.0.0 Jul 10, 2025
9 checks passed
@Samoed Samoed deleted the merge_main_v2_07_10 branch July 10, 2025 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.