Skip to content

Conversation

isaac-chung
Copy link
Contributor

@isaac-chung isaac-chung commented Feb 16, 2025

Fixes embeddings-benchmark/mteb#1823

Add MIEB results. The following models have been renamed to add org name (based on local test failures):

Related MTEB issue: embeddings-benchmark/mteb#2074

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the results files checker make pre-push.

Adding a model checklist

  • I have added model implementation to mteb/models/ directory. Instruction to add a model can be found here in the following PR ____

@isaac-chung
Copy link
Contributor Author

When pointing embeddings-benchmark/mteb#2035 to this branch, it seems like MIEB results cannot be displayed due to "Number of parameters".

@isaac-chung
Copy link
Contributor Author

isaac-chung commented Feb 16, 2025

@gowitheflow-1998 @KennethEnevoldsen here's a screenshot of the LB, hacked to point to this branch. eng and lite versions were able to render as well. Cache needed to be wiped.

Screenshot 2025-02-16 at 21 55 38

@gowitheflow-1998
Copy link
Contributor

there's a few task where the main metric was wrong when we implemented them and isn't matching with the paper. Let me double-check all tasks and get back. Might be a good idea to replace the scores in main metric with the actual main metrics before we merge I think

@isaac-chung isaac-chung marked this pull request as draft February 17, 2025 03:25
@KennethEnevoldsen
Copy link
Contributor

Also seems like the performance v. model size plot need some model references. You can add these in:

mteb.leaderboard.figures.models_to_annotate which is currently:

models_to_annotate = [
    "all-MiniLM-L6-v2",
    "GritLM-7B",
    "LaBSE",
    "multilingual-e5-large-instruct",
]

@isaac-chung
Copy link
Contributor Author

isaac-chung commented Feb 18, 2025

Also seems like the performance v. model size plot need some model references. You can add these in:

mteb.leaderboard.figures.models_to_annotate which is currently:

models_to_annotate = [
    "all-MiniLM-L6-v2",
    "GritLM-7B",
    "LaBSE",
    "multilingual-e5-large-instruct",
]

What does "some model reference" mean? How do we select the models for this list?

Figured it out 👍

[update] Added a few models that ranked first from a few task types:

  • "EVA02-CLIP-bigE-14-plus"
  • "voyage-multimodal-3"
  • "e5-v"
  • "VLM2Vec-Full"

@isaac-chung
Copy link
Contributor Author

The performance per task type plot isn't showing though 🤔 says it only contains one task type when there are 8.

@KennethEnevoldsen
Copy link
Contributor

hmm not sure why this is happening - @x-tabdeveloping do you have an idea?

@x-tabdeveloping
Copy link
Contributor

I'll have a look at it tomorrow

@x-tabdeveloping
Copy link
Contributor

@isaac-chung My guess would be it's cause of mteb.leaderboard.figures.task_types:

task_types = [
    "BitextMining",
    "Classification",
    "MultilabelClassification",
    "Clustering",
    "PairClassification",
    "Reranking",
    "Retrieval",
    "STS",
    "Summarization",
    # "InstructionRetrieval",
    # Not displayed, because the scores are negative,
    # doesn't work well with the radar chart.
    "Speed",
]

The reason I made this list was because instruction retrieval shows scores in the negatives and that doesn't really work with the radar chart.
We could either extend this list or just make a list with the exceptions and infer the list of task types from somewhere else.

@isaac-chung
Copy link
Contributor Author

@isaac-chung My guess would be it's cause of mteb.leaderboard.figures.task_types:

task_types = [
    "BitextMining",
    "Classification",
    "MultilabelClassification",
    "Clustering",
    "PairClassification",
    "Reranking",
    "Retrieval",
    "STS",
    "Summarization",
    # "InstructionRetrieval",
    # Not displayed, because the scores are negative,
    # doesn't work well with the radar chart.
    "Speed",
]

The reason I made this list was because instruction retrieval shows scores in the negatives and that doesn't really work with the radar chart. We could either extend this list or just make a list with the exceptions and infer the list of task types from somewhere else.

That's it. Thanks! It's working now.

@gowitheflow-1998
Copy link
Contributor

have fixed main metric issue by overwriting main scores with actual main metric scores; deleted previous incomplete Jina runs with a old version that only has a few task results.

overwritten scores include:

task_metric_mapping = {"BLINKIT2IRetrieval.json":"cv_recall_at_1",
 "BLINKIT2TRetrieval.json":"cv_recall_at_1",
 "ImageCoDeT2IRetrieval.json":"cv_recall_at_3",
 "ROxfordEasyI2IMultiChoice.json":"map_at_5",
 "ROxfordMediumI2IMultiChoice.json":"map_at_5",
 "ROxfordHardI2IMultiChoice.json":"map_at_5",
 "RParisEasyI2IMultiChoice.json":"map_at_5",
 "RParisMediumI2IMultiChoice.json":"map_at_5",
 "RParisHardI2IMultiChoice.json":"map_at_5", 
 "TinyImageNetClustering.json":"nmi",
 "CIFAR10Clustering.json":"nmi",
 "CIFAR100Clustering.json":"nmi",
 "ImageNet10Clustering.json":"nmi",
 "ImageNetDog15Clustering.json":"nmi",
}

@isaac-chung
Copy link
Contributor Author

@gowitheflow-1998 good stuff! Are we ready to merge?

@gowitheflow-1998 gowitheflow-1998 marked this pull request as ready for review February 23, 2025 13:59
@gowitheflow-1998 gowitheflow-1998 merged commit de7d977 into main Feb 23, 2025
2 checks passed
@gowitheflow-1998
Copy link
Contributor

@gowitheflow-1998 good stuff! Are we ready to merge?

yeah, merged! adding @Muennighoff as co-author for running most of the results here!

@Muennighoff
Copy link
Contributor

Does this have everything from https://github.com/embeddings-benchmark/tmp i.e. we can safely delete that repo?

@gowitheflow-1998
Copy link
Contributor

Does this have everything from https://github.com/embeddings-benchmark/tmp i.e. we can safely delete that repo?

yeah! all results are here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MIEB] migrate results from tmp repo to results repo
6 participants