add MIEB results and rename model to pass tests #122

isaac-chung · 2025-02-16T17:19:25Z

Fixes embeddings-benchmark/mteb#1823

Add MIEB results. The following models have been renamed to add org name (based on local test failures):

EVA models: added QuanSun: https://huggingface.co/QuanSun/EVA-CLIP/tree/main
voyage model: added voyageai

Related MTEB issue: embeddings-benchmark/mteb#2074

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the results files checker make pre-push.

Adding a model checklist

I have added model implementation to mteb/models/ directory. Instruction to add a model can be found here in the following PR ____

isaac-chung · 2025-02-16T17:48:48Z

When pointing embeddings-benchmark/mteb#2035 to this branch, it seems like MIEB results cannot be displayed due to "Number of parameters".

...BAAI__bge-visualized-base/98db10b10d22620010d06f11733346e1c98c34aa/AROVisualAttribution.json

results/BAAI__bge-visualized-base/98db10b10d22620010d06f11733346e1c98c34aa/model_meta.json

isaac-chung · 2025-02-16T19:59:05Z

@gowitheflow-1998 @KennethEnevoldsen here's a screenshot of the LB, hacked to point to this branch. eng and lite versions were able to render as well. Cache needed to be wiped.

gowitheflow-1998 · 2025-02-16T20:58:29Z

there's a few task where the main metric was wrong when we implemented them and isn't matching with the paper. Let me double-check all tasks and get back. Might be a good idea to replace the scores in main metric with the actual main metrics before we merge I think

KennethEnevoldsen · 2025-02-17T10:48:41Z

Also seems like the performance v. model size plot need some model references. You can add these in:

mteb.leaderboard.figures.models_to_annotate which is currently:

models_to_annotate = [
    "all-MiniLM-L6-v2",
    "GritLM-7B",
    "LaBSE",
    "multilingual-e5-large-instruct",
]

isaac-chung · 2025-02-18T06:39:51Z

Also seems like the performance v. model size plot need some model references. You can add these in:

mteb.leaderboard.figures.models_to_annotate which is currently:
models_to_annotate = [
    "all-MiniLM-L6-v2",
    "GritLM-7B",
    "LaBSE",
    "multilingual-e5-large-instruct",
]

~~What does "some model reference" mean? How do we select the models for this list?~~

Figured it out 👍

[update] Added a few models that ranked first from a few task types:

"EVA02-CLIP-bigE-14-plus"
"voyage-multimodal-3"
"e5-v"
"VLM2Vec-Full"

isaac-chung · 2025-02-18T07:13:14Z

The performance per task type plot isn't showing though 🤔 says it only contains one task type when there are 8.

KennethEnevoldsen · 2025-02-19T14:53:38Z

hmm not sure why this is happening - @x-tabdeveloping do you have an idea?

x-tabdeveloping · 2025-02-19T18:34:44Z

I'll have a look at it tomorrow

x-tabdeveloping · 2025-02-20T08:46:08Z

@isaac-chung My guess would be it's cause of mteb.leaderboard.figures.task_types:

task_types = [
    "BitextMining",
    "Classification",
    "MultilabelClassification",
    "Clustering",
    "PairClassification",
    "Reranking",
    "Retrieval",
    "STS",
    "Summarization",
    # "InstructionRetrieval",
    # Not displayed, because the scores are negative,
    # doesn't work well with the radar chart.
    "Speed",
]

The reason I made this list was because instruction retrieval shows scores in the negatives and that doesn't really work with the radar chart.
We could either extend this list or just make a list with the exceptions and infer the list of task types from somewhere else.

isaac-chung · 2025-02-21T10:33:44Z

@isaac-chung My guess would be it's cause of mteb.leaderboard.figures.task_types:
task_types = [
    "BitextMining",
    "Classification",
    "MultilabelClassification",
    "Clustering",
    "PairClassification",
    "Reranking",
    "Retrieval",
    "STS",
    "Summarization",
    # "InstructionRetrieval",
    # Not displayed, because the scores are negative,
    # doesn't work well with the radar chart.
    "Speed",
]
The reason I made this list was because instruction retrieval shows scores in the negatives and that doesn't really work with the radar chart. We could either extend this list or just make a list with the exceptions and infer the list of task types from somewhere else.

That's it. Thanks! It's working now.

gowitheflow-1998 · 2025-02-22T16:20:10Z

have fixed main metric issue by overwriting main scores with actual main metric scores; deleted previous incomplete Jina runs with a old version that only has a few task results.

overwritten scores include:

task_metric_mapping = {"BLINKIT2IRetrieval.json":"cv_recall_at_1",
 "BLINKIT2TRetrieval.json":"cv_recall_at_1",
 "ImageCoDeT2IRetrieval.json":"cv_recall_at_3",
 "ROxfordEasyI2IMultiChoice.json":"map_at_5",
 "ROxfordMediumI2IMultiChoice.json":"map_at_5",
 "ROxfordHardI2IMultiChoice.json":"map_at_5",
 "RParisEasyI2IMultiChoice.json":"map_at_5",
 "RParisMediumI2IMultiChoice.json":"map_at_5",
 "RParisHardI2IMultiChoice.json":"map_at_5", 
 "TinyImageNetClustering.json":"nmi",
 "CIFAR10Clustering.json":"nmi",
 "CIFAR100Clustering.json":"nmi",
 "ImageNet10Clustering.json":"nmi",
 "ImageNetDog15Clustering.json":"nmi",
}

isaac-chung · 2025-02-22T20:44:12Z

@gowitheflow-1998 good stuff! Are we ready to merge?

gowitheflow-1998 · 2025-02-23T14:13:34Z

@gowitheflow-1998 good stuff! Are we ready to merge?

yeah, merged! adding @Muennighoff as co-author for running most of the results here!

Muennighoff · 2025-02-23T15:23:41Z

Does this have everything from https://github.com/embeddings-benchmark/tmp i.e. we can safely delete that repo?

gowitheflow-1998 · 2025-02-23T15:46:50Z

Does this have everything from https://github.com/embeddings-benchmark/tmp i.e. we can safely delete that repo?

yeah! all results are here

add MIEB results and rename model to pass tests

f696f6a

isaac-chung mentioned this pull request Feb 16, 2025

misc: update model names to adjust for adding to results repo embeddings-benchmark/mteb#2074

Merged

4 tasks

isaac-chung requested review from gowitheflow-1998 and Samoed February 16, 2025 17:22

Samoed approved these changes Feb 16, 2025

View reviewed changes

Samoed reviewed Feb 16, 2025

View reviewed changes

...BAAI__bge-visualized-base/98db10b10d22620010d06f11733346e1c98c34aa/AROVisualAttribution.json Show resolved Hide resolved

results/BAAI__bge-visualized-base/98db10b10d22620010d06f11733346e1c98c34aa/model_meta.json Outdated Show resolved Hide resolved

update model meta from main

2bd5a1e

isaac-chung mentioned this pull request Feb 16, 2025

feat: Add MIEB and MIEB-lite as benchmarks embeddings-benchmark/mteb#2035

Merged

4 tasks

isaac-chung marked this pull request as draft February 17, 2025 03:25

gowitheflow-1998 and others added 3 commits February 22, 2025 02:12

Merge branch 'main' into add-mieb-results

f2acbf0

overwrite wrong main score with actual main metric

f94e419

delete jina old version

43e581b

gowitheflow-1998 marked this pull request as ready for review February 23, 2025 13:59

gowitheflow-1998 merged commit de7d977 into main Feb 23, 2025
2 checks passed

isaac-chung mentioned this pull request Mar 26, 2025

[MIEB] "capability measured"-Abstask 1-1 matching refactor [2/3]: reimplement r-Oxford and r-Paris embeddings-benchmark/mteb#2442

Merged

5 tasks

isaac-chung mentioned this pull request Apr 4, 2025

[v2] New encoder interface embeddings-benchmark/mteb#2415

Merged

17 tasks

Samoed mentioned this pull request Apr 4, 2025

[MIEB] align main metrics with leaderboard embeddings-benchmark/mteb#2489

Merged

17 tasks

add MIEB results and rename model to pass tests #122

add MIEB results and rename model to pass tests #122

Uh oh!

Conversation

isaac-chung commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Adding a model checklist

Uh oh!

isaac-chung commented Feb 16, 2025

Uh oh!

Uh oh!

Uh oh!

isaac-chung commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gowitheflow-1998 commented Feb 16, 2025

Uh oh!

KennethEnevoldsen commented Feb 17, 2025

Uh oh!

isaac-chung commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Feb 18, 2025

Uh oh!

KennethEnevoldsen commented Feb 19, 2025

Uh oh!

x-tabdeveloping commented Feb 19, 2025

Uh oh!

x-tabdeveloping commented Feb 20, 2025

Uh oh!

isaac-chung commented Feb 21, 2025

Uh oh!

gowitheflow-1998 commented Feb 22, 2025

Uh oh!

isaac-chung commented Feb 22, 2025

Uh oh!

Uh oh!

gowitheflow-1998 commented Feb 23, 2025

Uh oh!

Muennighoff commented Feb 23, 2025

Uh oh!

gowitheflow-1998 commented Feb 23, 2025

Uh oh!

Uh oh!

isaac-chung commented Feb 16, 2025 •

edited

Loading

isaac-chung commented Feb 16, 2025 •

edited

Loading

isaac-chung commented Feb 18, 2025 •

edited

Loading