-
Notifications
You must be signed in to change notification settings - Fork 85
Update Seed1.5-Embedding revision 4 #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Seed1.5-Embedding revision 4 #205
Conversation
@KennethEnevoldsen Do these results look good? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting looks reasonable
Here is the results table of MTEB(eng, v2):
task_name | ByteDance-Seed/Seed1.5-Embedding | google/gemini-embedding-001 | intfloat/e5-large-v2 | nvidia/NV-Embed-v2 |
---|---|---|---|---|
AmazonCounterfactualClassification | 0.92 | 0.93 | 0.78 | 0.79 |
ArXivHierarchicalClusteringP2P | 0.65 | 0.65 | 0.58 | 0.60 |
ArXivHierarchicalClusteringS2S | 0.64 | 0.64 | 0.55 | 0.59 |
ArguAna | 0.78 | 0.86 | 0.46 | 0.70 |
AskUbuntuDupQuestions | 0.69 | 0.64 | 0.6 | 0.67 |
BIOSSES | 0.85 | 0.89 | 0.84 | 0.87 |
Banking77Classification | 0.91 | 0.94 | 0.85 | 0.92 |
BiorxivClusteringP2P.v2 | 0.55 | 0.54 | 0.4 | 0.44 |
CQADupstackGamingRetrieval | 0.70 | 0.71 | 0.58 | 0.65 |
CQADupstackUnixRetrieval | 0.57 | 0.54 | 0.39 | 0.52 |
ClimateFEVERHardNegatives | 0.47 | 0.31 | 0.23 | 0.33 |
FEVERHardNegatives | 0.95 | 0.89 | 0.83 | 0.90 |
FiQA2018 | 0.66 | 0.62 | 0.41 | 0.66 |
HotpotQAHardNegatives | 0.88 | 0.87 | 0.73 | 0.84 |
ImdbClassification | 0.97 | 0.95 | 0.92 | 0.97 |
MTOPDomainClassification | 0.99 | 0.99 | 0.93 | 0.96 |
MassiveIntentClassification | 0.87 | 0.88 | 0.68 | 0.78 |
MassiveScenarioClassification | 0.93 | 0.92 | 0.71 | 0.81 |
MedrxivClusteringP2P.v2 | 0.51 | 0.47 | 0.35 | 0.37 |
MedrxivClusteringS2S.v2 | 0.51 | 0.45 | 0.34 | 0.36 |
MindSmallReranking | 0.32 | 0.33 | 0.32 | 0.32 |
SCIDOCS | 0.25 | 0.25 | 0.2 | 0.22 |
SICK-R | 0.84 | 0.83 | 0.79 | 0.82 |
STS12 | 0.85 | 0.82 | 0.74 | 0.78 |
STS13 | 0.92 | 0.90 | 0.81 | 0.88 |
STS14 | 0.90 | 0.85 | 0.79 | 0.84 |
STS15 | 0.92 | 0.90 | 0.88 | 0.89 |
STS17 | 0.93 | 0.92 | 0.9 | 0.91 |
STS22.v2 | 0.71 | 0.68 | 0.67 | 0.66 |
STSBenchmark | 0.92 | 0.89 | 0.85 | 0.88 |
SprintDuplicateQuestions | 0.97 | 0.97 | 0.95 | 0.97 |
StackExchangeClustering.v2 | 0.80 | 0.92 | 0.52 | 0.55 |
StackExchangeClusteringP2P.v2 | 0.52 | 0.51 | 0.4 | 0.45 |
SummEvalSummarization.v2 | 0.35 | 0.38 | 0.32 | 0.35 |
TRECCOVID | 0.88 | 0.86 | 0.67 | 0.89 |
Touche2020Retrieval.v3 | 0.64 | 0.52 | 0.42 | 0.57 |
ToxicConversationsClassification | 0.86 | 0.89 | 0.63 | 0.93 |
TweetSentimentExtractionClassification | 0.72 | 0.70 | 0.61 | 0.81 |
TwentyNewsgroupsClustering.v2 | 0.63 | 0.57 | 0.48 | 0.45 |
TwitterSemEval2015 | 0.77 | 0.79 | 0.77 | 0.81 |
TwitterURLCorpus | 0.87 | 0.87 | 0.86 | 0.88 |
Average | 0.75 | 0.73 | 0.63 | 0.70 |
and here is the full a table for all models:
task_name | ByteDance-Seed/Seed1.5-Embedding | google/gemini-embedding-001 | intfloat/e5-large-v2 | nvidia/NV-Embed-v2 |
---|---|---|---|---|
AFQMC | 0.57 | nan | nan | nan |
ATEC | 0.54 | nan | nan | nan |
AmazonCounterfactualClassification | 0.92 | 0.88 | 0.68 | 0.78 |
AmazonReviewsClassification | 0.58 | nan | 0.35 | 0.47 |
ArXivHierarchicalClusteringP2P | 0.65 | 0.65 | 0.58 | 0.60 |
ArXivHierarchicalClusteringS2S | 0.64 | 0.64 | 0.55 | 0.59 |
ArguAna | 0.78 | 0.86 | 0.46 | 0.70 |
AskUbuntuDupQuestions | 0.69 | 0.64 | 0.6 | 0.67 |
BIOSSES | 0.85 | 0.89 | 0.84 | 0.87 |
BQ | 0.70 | nan | nan | nan |
Banking77Classification | 0.91 | 0.94 | 0.85 | 0.92 |
BiorxivClusteringP2P.v2 | 0.55 | 0.54 | 0.4 | 0.44 |
BrightRetrieval | 0.27 | nan | nan | nan |
CLSClusteringP2P | 0.54 | nan | nan | nan |
CLSClusteringS2S | 0.62 | nan | nan | nan |
CMedQAv1-reranking | 0.82 | nan | nan | nan |
CMedQAv2-reranking | 0.84 | nan | 0.23 | 0.76 |
CQADupstackGamingRetrieval | 0.70 | 0.71 | 0.58 | 0.65 |
CQADupstackUnixRetrieval | 0.57 | 0.54 | 0.39 | 0.52 |
ClimateFEVERHardNegatives | 0.47 | 0.31 | 0.23 | 0.33 |
CmedqaRetrieval | 0.52 | nan | 0.03 | 0.31 |
Cmnli | 0.91 | nan | nan | nan |
CovidRetrieval | 0.88 | 0.79 | 0.2 | 0.59 |
DuRetrieval | 0.94 | nan | nan | nan |
EcomRetrieval | 0.73 | nan | nan | nan |
FEVERHardNegatives | 0.95 | 0.89 | 0.83 | 0.90 |
FiQA2018 | 0.66 | 0.62 | 0.41 | 0.66 |
HotpotQAHardNegatives | 0.88 | 0.87 | 0.73 | 0.84 |
IFlyTek | 0.56 | nan | nan | nan |
ImdbClassification | 0.97 | 0.95 | 0.92 | 0.97 |
JDReview | 0.89 | nan | nan | nan |
LCQMC | 0.81 | nan | nan | nan |
MMarcoReranking | 0.36 | nan | nan | nan |
MMarcoRetrieval | 0.89 | nan | nan | nan |
MTOPDomainClassification | 0.99 | 0.98 | 0.66 | 0.90 |
MassiveIntentClassification | 0.86 | 0.82 | 0.33 | 0.58 |
MassiveScenarioClassification | 0.92 | 0.87 | 0.4 | 0.63 |
MedicalRetrieval | 0.71 | nan | nan | nan |
MedrxivClusteringP2P.v2 | 0.51 | 0.47 | 0.35 | 0.37 |
MedrxivClusteringS2S.v2 | 0.51 | 0.45 | 0.34 | 0.36 |
MindSmallReranking | 0.32 | 0.33 | 0.32 | 0.32 |
MultilingualSentiment | 0.83 | nan | nan | nan |
Ocnli | 0.84 | nan | nan | nan |
OnlineShopping | 0.96 | nan | nan | nan |
PAWSX | 0.68 | nan | nan | nan |
QBQTC | 0.52 | nan | nan | nan |
SCIDOCS | 0.25 | 0.25 | 0.2 | 0.22 |
SICK-R | 0.84 | 0.83 | 0.79 | 0.82 |
STS12 | 0.85 | 0.82 | 0.74 | 0.78 |
STS13 | 0.92 | 0.90 | 0.81 | 0.88 |
STS14 | 0.90 | 0.85 | 0.79 | 0.84 |
STS15 | 0.92 | 0.90 | 0.88 | 0.89 |
STS17 | 0.93 | 0.89 | 0.48 | 0.91 |
STS22.v2 | 0.72 | 0.72 | 0.57 | 0.61 |
STSB | 0.86 | 0.85 | 0.43 | 0.78 |
STSBenchmark | 0.92 | 0.89 | 0.85 | 0.88 |
SprintDuplicateQuestions | 0.97 | 0.97 | 0.95 | 0.97 |
StackExchangeClustering.v2 | 0.80 | 0.92 | 0.52 | 0.55 |
StackExchangeClusteringP2P.v2 | 0.52 | 0.51 | 0.4 | 0.45 |
SummEvalSummarization.v2 | 0.35 | 0.38 | 0.32 | 0.35 |
T2Reranking | 0.67 | 0.68 | 0.6 | 0.67 |
T2Retrieval | 0.90 | nan | nan | nan |
TNews | 0.57 | nan | nan | nan |
TRECCOVID | 0.88 | 0.86 | 0.67 | 0.89 |
ThuNewsClusteringP2P | 0.83 | nan | nan | nan |
ThuNewsClusteringS2S | 0.85 | nan | nan | nan |
Touche2020Retrieval.v3 | 0.64 | 0.52 | 0.42 | 0.57 |
ToxicConversationsClassification | 0.86 | 0.89 | 0.63 | 0.93 |
TweetSentimentExtractionClassification | 0.72 | 0.70 | 0.61 | 0.81 |
TwentyNewsgroupsClustering.v2 | 0.63 | 0.57 | 0.48 | 0.45 |
TwitterSemEval2015 | 0.77 | 0.79 | 0.77 | 0.81 |
TwitterURLCorpus | 0.87 | 0.87 | 0.86 | 0.88 |
VideoRetrieval | 0.81 | nan | nan | nan |
Waimai | 0.92 | nan | nan | nan |
Average | 0.74 | 0.73 | 0.55 | 0.67 |
@KennethEnevoldsen If there is no other issues, could you please approve and merge this PR so that the results on the leaderboard is correct? Please let me know if there is any other problems :) Thanks in advance. |
Hi @namespace-Pt, sorry I just wanted to look through the scores - A few look quite high, ClimateFEVER, FEVER, and touche2020. Can I get a confirmation that these results are correct? |
Hi @KennethEnevoldsen. Yes the results are correct. We gurantee no contamination during our training process. |
BTW, the results of nv-embed-v2 on FEVER, ClimateFEVER, Touche are underestimate currently, I think due to the misuse of instruction. From my own testing, if using the correct instructions (as stated in their paper), the results of nv-embed-v2 should be similar or even higher than ours (FEVER 0.95, ClimateFEVER 0.45, Touche 0.65). |
Thanks - ahh didn't know we didn't match the instructions, but NV-Embed is also trained specifically on those datasets, so would expect a bit of an inflated performance |
Added comment about instructions to embeddings-benchmark/mteb#1600 |
Checklist
mteb/models/
this can be as an API. Instruction on how to add a model can be found here@KennethEnevoldsen created a revision
4
in this PR. In order to show the correct results on leaderboard, I've copied the results from revision3
to revision4
.