Skip to content

Conversation

Samoed
Copy link
Member

@Samoed Samoed commented Mar 22, 2025

Closes #1735

  • Renamed Wrapper to AbsEncoder
  • Added similarity and pairwise similarity functions to Encoder interface
  • Created file for similarity functions and moved PromptType and BatchedInput to types file

All tasks now use the model's similarity implementation cosine by default. This allows for new approaches for similarity, but does mean that scores will change:

  • Results on bitext and retrieval models that use not cosine similarity. For now there is only facebook/contriever-msmarco, that was added in v2

For now, I don't know what to do with CrossEncoders. Should we create another interface for them?

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

@Samoed Samoed marked this pull request as ready for review March 23, 2025 21:39
@Samoed
Copy link
Member Author

Samoed commented Mar 25, 2025

@isaac-chung @KennethEnevoldsen Can you review?

This was referenced Mar 26, 2025
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of good stuff here - sorry the second part of the review came so long after the first comments (the GitHub UI is simply excessively slow when working with large diffs - this is also why some of the comments are a bit terse)

@Samoed
Copy link
Member Author

Samoed commented Apr 1, 2025

There are 18 model groups and some of them are difficult to run and some are too big for me to run. This PR will not change much for the models, I think we can check them later, or can you provide top 5 models for us to check?

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think this one is ready to merge

@isaac-chung
Copy link
Collaborator

There are 18 model groups and some of them are difficult to run and some are too big for me to run. This PR will not change much for the models, I think we can check them later, or can you provide top 5 models for us to check?

Hmm, if you can run every model in this screenshot that is runnable, it's probably a good enough check. Thanks again in advance!
Screenshot 2025-04-02 at 14 47 43 (taken from the MIEB multilingual LB, sort by mean (task type))

CC @gowitheflow-1998 this is a change in v2.

@Samoed
Copy link
Member Author

Samoed commented Apr 2, 2025

Your're sharing MTEB(Multilingual, v1). For me MIEB(Multilingual) looks like:

image

@isaac-chung
Copy link
Collaborator

(taken from the MIEB multilingual LB, sort by mean (task type))

Did you try this?

@Samoed
Copy link
Member Author

Samoed commented Apr 2, 2025

It's a bit different with the sort, but still you're sharing MTEB, because GritLM, gte, e5-mistral is text only model
image

@isaac-chung
Copy link
Collaborator

Ah yes, yours is correct. e5-v is at the top.

@Samoed
Copy link
Member Author

Samoed commented Apr 3, 2025

@orionw Hope you’ve successfully submitted to COLM! Could you take a look at the PR when you have a moment?

Regarding the EncoderInterface, maybe you have some suggestions for how we could better support cross-encoders?

@orionw
Copy link
Contributor

orionw commented Apr 4, 2025

Thanks for the ping @Samoed, sorry for the delay. It seems like this is the same interface that the cross-encoders currently have: namely the __init__, combine_query_and_instruction, and predict.

I have missed a lot of the conversation here so apologies if this has been hashed out: the only thing I think is confusing is we lump cross-encoders into the same class as the bi-encoders (also same with the base DenseRetrievalExactSearch class). I would prefer to have them be separate interfaces as there really isn't any overlap between how they work (except for GritLM).

Otherwise it looks good and complete w.r.t. cross-encoders!

@Samoed
Copy link
Member Author

Samoed commented Apr 4, 2025

I think is confusing is we lump cross-encoders into the same class as the bi-encoders

Yes, I understand your point, but we haven't implemented the cross-encoder interface anywhere yet, and it would be difficult to check each time if a model is a cross-encoder. I have made a change based on the ModelMeta.is_cross_encoder, so the predict function will only be applied if the model is a cross-encoder. For now, I think it would be easier to support a single interface.

@Samoed
Copy link
Member Author

Samoed commented Apr 4, 2025

@isaac-chung I've tested models.

Results from this PR:

task_name google/siglip-base-patch16-256-multilingual google/siglip-base-patch16-384 google/siglip-base-patch16-512 google/siglip-large-patch16-256 google/siglip-large-patch16-384 google/siglip-so400m-patch14-384 laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
AROFlickrOrder 35.78 34.36 31.94 37.58 33.2 35 36.18 38.86
BLINKIT2IRetrieval 49.76 50.44 51.92 48.35 49.69 50.33 50.38 52.28
CVBenchCount 11.42 22.72 21.19 13.2 10.91 3.93 26.4 20.05
MNIST 94.61 95.24 95.38 96.05 96.13 95.95 96.03 95.74
MNISTZeroShot 83.1 80.88 82.86 85 85.17 88.56 81.46 77.1
STS12VisualSTS 66.62 64.62 64.97 63.94 66.3 61.9 62.37 62.81

Results from leaderboard:

task_name google/siglip-base-patch16-256-multilingual google/siglip-base-patch16-384 google/siglip-base-patch16-512 google/siglip-large-patch16-256 google/siglip-large-patch16-384 google/siglip-so400m-patch14-384 laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
AROFlickrOrder 36 34.18 32.14 37.76 33.34 34.92 34.56 38.84
BLINKIT2IRetrieval 37.31 35.82 38.31 35.57 33.58 36.82 36.32 39.55
CVBenchCount 34.64 53.43 55.2 8.88 8.76 21.7 43.27 4.19
MNIST 94.98 95.35 95.48 96.05 96.21 96.11 96.16 96.12
MNISTZeroShot 83.09 80.88 82.85 85 85.17 88.56 81.44 77.09
STS12VisualSTS 66.62 64.62 64.97 63.94 66.3 61.9 62.36 62.81

BLINKIT2IRetrieval and CVBenchCount results don't match, and I've run them from main. And I can't reporoduce scores from leaderboard for BLINKIT2IRetrieval. For CVBenchCount I don't understand why results from my PR not matching

task_name google/siglip-base-patch16-512 google/siglip-large-patch16-256 google/siglip-large-patch16-384 google/siglip-so400m-patch14-384
BLINKIT2IRetrieval 51.88 48.47 49.86 50.32
CVBenchCount 55.2 8.88 8.76 21.7

@isaac-chung
Copy link
Collaborator

isaac-chung commented Apr 4, 2025

Odd. I was able to get 'main_score': 0.55203 on main for google/siglip-base-patch16-512 and CVBenchCount. Can take a look at the PR's score.

@Samoed
Copy link
Member Author

Samoed commented Apr 4, 2025

Yes, I'm getting similar score 51.88 on BLINKIT2IRetrieval, but on leaderboard reported 38.31. I think at one time metric of retrieval task was changed, because it has ndcg@10 0.51, but main score 38.31 (maybe previously it had ndcg@1 and because of that main score is different) https://github.com/embeddings-benchmark/results/blob/7efced57a95d46c32df7f2bdb51bf0ea77944fbc/results/google__siglip-base-patch16-512/753a949581523b60257d93e18391e8c27f72eb22/BLINKIT2IRetrieval.json#L162

CC @gowitheflow-1998

@isaac-chung
Copy link
Collaborator

I got 'main_score': 0.55457 from v2.0.0 for google/siglip-base-patch16-512 and CVBenchCount. So far main, v2.0.0, and the LB agree for this model-task combination.

@Samoed
Copy link
Member Author

Samoed commented Apr 4, 2025

@isaac-chung
Copy link
Collaborator

We previously overwrote LB scores to use cv_recall_at_1 for BLINKIT2IRetrieval: embeddings-benchmark/results#122 (comment). We should update main_score to cv_recall_at_1 as well.

@isaac-chung
Copy link
Collaborator

isaac-chung commented Apr 4, 2025

That should be it, thanks so much, @Samoed !
@gowitheflow-1998 I think we anticipated small deviations like 1-2 decimal points. So this looks decent.

@Samoed
Copy link
Member Author

Samoed commented Apr 4, 2025

Scores after main merge are matching for CVBenchCount too

task_name google/siglip-base-patch16-512 google/siglip-large-patch16-256 google/siglip-large-patch16-384 google/siglip-so400m-patch14-384
CVBenchCount 55.46 8.76 8.76 21.7

@Samoed Samoed merged commit 9460039 into v2.0.0 Apr 4, 2025
9 checks passed
@Samoed Samoed deleted the new_encoder_interface branch April 4, 2025 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants