Skip to content

Conversation

namespace-Pt
Copy link
Contributor

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
    • No, but there is an existing PR ___
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not training on the dataset including the training set. If I have I have disclosed it clearly.

@KennethEnevoldsen created a revision 4 in this PR. In order to show the correct results on leaderboard, I've copied the results from revision 3 to revision 4.

@Samoed
Copy link
Member

Samoed commented May 27, 2025

@KennethEnevoldsen Do these results look good?

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting looks reasonable

Here is the results table of MTEB(eng, v2):

task_name ByteDance-Seed/Seed1.5-Embedding google/gemini-embedding-001 intfloat/e5-large-v2 nvidia/NV-Embed-v2
AmazonCounterfactualClassification 0.92 0.93 0.78 0.79
ArXivHierarchicalClusteringP2P 0.65 0.65 0.58 0.60
ArXivHierarchicalClusteringS2S 0.64 0.64 0.55 0.59
ArguAna 0.78 0.86 0.46 0.70
AskUbuntuDupQuestions 0.69 0.64 0.6 0.67
BIOSSES 0.85 0.89 0.84 0.87
Banking77Classification 0.91 0.94 0.85 0.92
BiorxivClusteringP2P.v2 0.55 0.54 0.4 0.44
CQADupstackGamingRetrieval 0.70 0.71 0.58 0.65
CQADupstackUnixRetrieval 0.57 0.54 0.39 0.52
ClimateFEVERHardNegatives 0.47 0.31 0.23 0.33
FEVERHardNegatives 0.95 0.89 0.83 0.90
FiQA2018 0.66 0.62 0.41 0.66
HotpotQAHardNegatives 0.88 0.87 0.73 0.84
ImdbClassification 0.97 0.95 0.92 0.97
MTOPDomainClassification 0.99 0.99 0.93 0.96
MassiveIntentClassification 0.87 0.88 0.68 0.78
MassiveScenarioClassification 0.93 0.92 0.71 0.81
MedrxivClusteringP2P.v2 0.51 0.47 0.35 0.37
MedrxivClusteringS2S.v2 0.51 0.45 0.34 0.36
MindSmallReranking 0.32 0.33 0.32 0.32
SCIDOCS 0.25 0.25 0.2 0.22
SICK-R 0.84 0.83 0.79 0.82
STS12 0.85 0.82 0.74 0.78
STS13 0.92 0.90 0.81 0.88
STS14 0.90 0.85 0.79 0.84
STS15 0.92 0.90 0.88 0.89
STS17 0.93 0.92 0.9 0.91
STS22.v2 0.71 0.68 0.67 0.66
STSBenchmark 0.92 0.89 0.85 0.88
SprintDuplicateQuestions 0.97 0.97 0.95 0.97
StackExchangeClustering.v2 0.80 0.92 0.52 0.55
StackExchangeClusteringP2P.v2 0.52 0.51 0.4 0.45
SummEvalSummarization.v2 0.35 0.38 0.32 0.35
TRECCOVID 0.88 0.86 0.67 0.89
Touche2020Retrieval.v3 0.64 0.52 0.42 0.57
ToxicConversationsClassification 0.86 0.89 0.63 0.93
TweetSentimentExtractionClassification 0.72 0.70 0.61 0.81
TwentyNewsgroupsClustering.v2 0.63 0.57 0.48 0.45
TwitterSemEval2015 0.77 0.79 0.77 0.81
TwitterURLCorpus 0.87 0.87 0.86 0.88
Average 0.75 0.73 0.63 0.70

and here is the full a table for all models:

task_name ByteDance-Seed/Seed1.5-Embedding google/gemini-embedding-001 intfloat/e5-large-v2 nvidia/NV-Embed-v2
AFQMC 0.57 nan nan nan
ATEC 0.54 nan nan nan
AmazonCounterfactualClassification 0.92 0.88 0.68 0.78
AmazonReviewsClassification 0.58 nan 0.35 0.47
ArXivHierarchicalClusteringP2P 0.65 0.65 0.58 0.60
ArXivHierarchicalClusteringS2S 0.64 0.64 0.55 0.59
ArguAna 0.78 0.86 0.46 0.70
AskUbuntuDupQuestions 0.69 0.64 0.6 0.67
BIOSSES 0.85 0.89 0.84 0.87
BQ 0.70 nan nan nan
Banking77Classification 0.91 0.94 0.85 0.92
BiorxivClusteringP2P.v2 0.55 0.54 0.4 0.44
BrightRetrieval 0.27 nan nan nan
CLSClusteringP2P 0.54 nan nan nan
CLSClusteringS2S 0.62 nan nan nan
CMedQAv1-reranking 0.82 nan nan nan
CMedQAv2-reranking 0.84 nan 0.23 0.76
CQADupstackGamingRetrieval 0.70 0.71 0.58 0.65
CQADupstackUnixRetrieval 0.57 0.54 0.39 0.52
ClimateFEVERHardNegatives 0.47 0.31 0.23 0.33
CmedqaRetrieval 0.52 nan 0.03 0.31
Cmnli 0.91 nan nan nan
CovidRetrieval 0.88 0.79 0.2 0.59
DuRetrieval 0.94 nan nan nan
EcomRetrieval 0.73 nan nan nan
FEVERHardNegatives 0.95 0.89 0.83 0.90
FiQA2018 0.66 0.62 0.41 0.66
HotpotQAHardNegatives 0.88 0.87 0.73 0.84
IFlyTek 0.56 nan nan nan
ImdbClassification 0.97 0.95 0.92 0.97
JDReview 0.89 nan nan nan
LCQMC 0.81 nan nan nan
MMarcoReranking 0.36 nan nan nan
MMarcoRetrieval 0.89 nan nan nan
MTOPDomainClassification 0.99 0.98 0.66 0.90
MassiveIntentClassification 0.86 0.82 0.33 0.58
MassiveScenarioClassification 0.92 0.87 0.4 0.63
MedicalRetrieval 0.71 nan nan nan
MedrxivClusteringP2P.v2 0.51 0.47 0.35 0.37
MedrxivClusteringS2S.v2 0.51 0.45 0.34 0.36
MindSmallReranking 0.32 0.33 0.32 0.32
MultilingualSentiment 0.83 nan nan nan
Ocnli 0.84 nan nan nan
OnlineShopping 0.96 nan nan nan
PAWSX 0.68 nan nan nan
QBQTC 0.52 nan nan nan
SCIDOCS 0.25 0.25 0.2 0.22
SICK-R 0.84 0.83 0.79 0.82
STS12 0.85 0.82 0.74 0.78
STS13 0.92 0.90 0.81 0.88
STS14 0.90 0.85 0.79 0.84
STS15 0.92 0.90 0.88 0.89
STS17 0.93 0.89 0.48 0.91
STS22.v2 0.72 0.72 0.57 0.61
STSB 0.86 0.85 0.43 0.78
STSBenchmark 0.92 0.89 0.85 0.88
SprintDuplicateQuestions 0.97 0.97 0.95 0.97
StackExchangeClustering.v2 0.80 0.92 0.52 0.55
StackExchangeClusteringP2P.v2 0.52 0.51 0.4 0.45
SummEvalSummarization.v2 0.35 0.38 0.32 0.35
T2Reranking 0.67 0.68 0.6 0.67
T2Retrieval 0.90 nan nan nan
TNews 0.57 nan nan nan
TRECCOVID 0.88 0.86 0.67 0.89
ThuNewsClusteringP2P 0.83 nan nan nan
ThuNewsClusteringS2S 0.85 nan nan nan
Touche2020Retrieval.v3 0.64 0.52 0.42 0.57
ToxicConversationsClassification 0.86 0.89 0.63 0.93
TweetSentimentExtractionClassification 0.72 0.70 0.61 0.81
TwentyNewsgroupsClustering.v2 0.63 0.57 0.48 0.45
TwitterSemEval2015 0.77 0.79 0.77 0.81
TwitterURLCorpus 0.87 0.87 0.86 0.88
VideoRetrieval 0.81 nan nan nan
Waimai 0.92 nan nan nan
Average 0.74 0.73 0.55 0.67

@namespace-Pt
Copy link
Contributor Author

@KennethEnevoldsen If there is no other issues, could you please approve and merge this PR so that the results on the leaderboard is correct? Please let me know if there is any other problems :) Thanks in advance.

@KennethEnevoldsen
Copy link
Contributor

Hi @namespace-Pt, sorry I just wanted to look through the scores - A few look quite high, ClimateFEVER, FEVER, and touche2020. Can I get a confirmation that these results are correct?

@namespace-Pt
Copy link
Contributor Author

Hi @KennethEnevoldsen. Yes the results are correct. We gurantee no contamination during our training process.

@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) May 27, 2025 17:09
@namespace-Pt
Copy link
Contributor Author

BTW, the results of nv-embed-v2 on FEVER, ClimateFEVER, Touche are underestimate currently, I think due to the misuse of instruction. From my own testing, if using the correct instructions (as stated in their paper), the results of nv-embed-v2 should be similar or even higher than ours (FEVER 0.95, ClimateFEVER 0.45, Touche 0.65).

@KennethEnevoldsen
Copy link
Contributor

Thanks - ahh didn't know we didn't match the instructions, but NV-Embed is also trained specifically on those datasets, so would expect a bit of an inflated performance

@Samoed
Copy link
Member

Samoed commented May 27, 2025

Added comment about instructions to embeddings-benchmark/mteb#1600

@KennethEnevoldsen KennethEnevoldsen merged commit 0f6fab6 into embeddings-benchmark:main May 27, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants