-
Notifications
You must be signed in to change notification settings - Fork 89
Qzhou embedding results #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qzhou embedding results #250
Conversation
Model Results ComparisonReference models: Results for
|
task_name | Kingsoft-LLM/QZhou-Embedding | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
---|---|---|---|---|
AFQMC | 0.67 | nan | 0.33 | 0.72 |
ATEC | 0.55 | nan | 0.4 | 0.65 |
AmazonCounterfactualClassification | 0.93 | 0.88 | 0.7 | 0.97 |
ArXivHierarchicalClusteringP2P | 0.66 | 0.65 | 0.56 | 0.69 |
ArXivHierarchicalClusteringS2S | 0.64 | 0.64 | 0.54 | 0.65 |
ArguAna | 0.84 | 0.86 | 0.54 | 0.90 |
AskUbuntuDupQuestions | 0.69 | 0.64 | 0.59 | 0.70 |
BIOSSES | 0.93 | 0.89 | 0.85 | 0.97 |
BQ | 0.77 | nan | 0.48 | 0.81 |
Banking77Classification | 0.85 | 0.94 | 0.75 | 0.94 |
BiorxivClusteringP2P.v2 | 0.54 | 0.54 | 0.37 | 0.56 |
CLSClusteringP2P | 0.65 | nan | nan | 0.82 |
CLSClusteringS2S | 0.61 | nan | nan | 0.74 |
CMedQAv1-reranking | 0.94 | nan | 0.68 | 0.94 |
CMedQAv2-reranking | 0.94 | nan | 0.67 | 0.94 |
CQADupstackGamingRetrieval | 0.76 | 0.71 | 0.59 | 0.79 |
CQADupstackUnixRetrieval | 0.71 | 0.54 | 0.4 | 0.72 |
ClimateFEVERHardNegatives | 0.49 | 0.31 | 0.26 | 0.49 |
CmedqaRetrieval | 0.52 | nan | 0.29 | 0.57 |
Cmnli | 0.95 | nan | nan | 0.95 |
CovidRetrieval | 0.93 | 0.79 | 0.76 | 0.96 |
DuRetrieval | 0.92 | nan | 0.85 | 0.94 |
EcomRetrieval | 0.77 | nan | 0.55 | 0.78 |
FEVERHardNegatives | 0.94 | 0.89 | 0.84 | 0.95 |
FiQA2018 | 0.60 | 0.62 | 0.44 | 0.80 |
HotpotQAHardNegatives | 0.81 | 0.87 | 0.71 | 0.87 |
IFlyTek | 0.58 | nan | 0.42 | 0.58 |
ImdbClassification | 0.96 | 0.95 | 0.89 | 0.97 |
JDReview | 0.88 | nan | 0.81 | 0.92 |
LCQMC | 0.82 | nan | 0.76 | 0.82 |
MMarcoReranking | 0.44 | nan | 0.29 | 0.47 |
MMarcoRetrieval | 0.83 | nan | 0.79 | 0.90 |
MTOPDomainClassification | 0.96 | 0.98 | 0.9 | 1.00 |
MassiveIntentClassification | 0.55 | 0.82 | 0.6 | 0.92 |
MassiveScenarioClassification | 0.74 | 0.87 | 0.7 | 0.99 |
MedicalRetrieval | 0.73 | nan | 0.51 | 0.76 |
MedrxivClusteringP2P.v2 | 0.50 | 0.47 | 0.34 | 0.52 |
MedrxivClusteringS2S.v2 | 0.48 | 0.45 | 0.32 | 0.51 |
MindSmallReranking | 0.34 | 0.33 | 0.3 | 0.34 |
MultilingualSentiment | 0.85 | nan | 0.71 | 0.85 |
Ocnli | 0.95 | nan | nan | 0.95 |
OnlineShopping | 0.96 | nan | 0.9 | 0.97 |
PAWSX | 0.70 | nan | 0.15 | 0.70 |
QBQTC | 0.60 | nan | nan | 0.71 |
SCIDOCS | 0.29 | 0.25 | 0.17 | 0.35 |
SICK-R | 0.88 | 0.83 | 0.8 | 0.95 |
STS12 | 0.90 | 0.82 | 0.8 | 0.95 |
STS13 | 0.96 | 0.90 | 0.82 | 0.98 |
STS14 | 0.93 | 0.85 | 0.78 | 0.98 |
STS15 | 0.95 | 0.90 | 0.89 | 0.98 |
STS17 | 0.89 | 0.89 | 0.82 | 0.93 |
STS22.v2 | 0.77 | 0.72 | 0.64 | 0.77 |
STSB | 0.92 | 0.85 | 0.82 | 0.92 |
STSBenchmark | 0.95 | 0.89 | 0.87 | 0.95 |
SprintDuplicateQuestions | 0.98 | 0.97 | 0.93 | 0.98 |
StackExchangeClustering.v2 | 0.76 | 0.92 | 0.46 | 0.92 |
StackExchangeClusteringP2P.v2 | 0.55 | 0.51 | 0.39 | 0.55 |
SummEvalSummarization.v2 | 0.33 | 0.38 | 0.31 | 0.39 |
T2Reranking | 0.68 | 0.68 | 0.66 | 0.73 |
T2Retrieval | 0.82 | nan | 0.76 | 0.89 |
TNews | 0.61 | nan | 0.49 | 0.61 |
TRECCOVID | 0.78 | 0.86 | 0.71 | 0.95 |
ThuNewsClusteringP2P | 0.82 | nan | nan | 0.89 |
ThuNewsClusteringS2S | 0.76 | nan | nan | 0.88 |
Touche2020Retrieval.v3 | 0.50 | 0.52 | 0.5 | 0.75 |
ToxicConversationsClassification | 0.90 | 0.89 | 0.66 | 0.98 |
TweetSentimentExtractionClassification | 0.77 | 0.70 | 0.63 | 0.88 |
TwentyNewsgroupsClustering.v2 | 0.81 | 0.57 | 0.39 | 0.88 |
TwitterSemEval2015 | 0.87 | 0.79 | 0.75 | 0.89 |
TwitterURLCorpus | 0.92 | 0.87 | 0.86 | 0.96 |
VideoRetrieval | 0.79 | nan | 0.58 | 0.84 |
Waimai | 0.92 | nan | 0.86 | 0.92 |
Average | 0.76 | 0.73 | 0.61 | 0.81 |
correct model_meta info
past PR: #249 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
completed
We are still waiting for the model PR to merge :) |
Thanks! I have looked over the scores, and a few seem suspiciously high:
However, it seems like these are not in the annotated training data: import mteb
meta = mteb.get_model_meta("Kingsoft-LLM/QZhou-Embedding")
# in training data
"AmazonCounterfactualClassification" in meta.training_datasets # True
# not in:
"AskUbuntuDupQuestions" in meta.training_datasets # False
"BQ" in meta.training_datasets # False
"Waimai" in meta.training_datasets # False
"TNews" in meta.training_datasets # False
"IFlyTek" in meta.training_datasets # False @PennyYu123 can you help me figure out these scores? Could you have missed some annotations or synthetically generated matching training data? |
Yes, it would great if you'd add them to training datasets |
You can add your new scores in new subfolder with your new revision |
Hello, our new model results have been uploaded. We have already submitted a PR to mteb repo. We have also replaced the original model parameter file with our new one in huggingface. Let's continue the previous process. 😊😊😊 |
Hi @PennyYu123, I have merged the PR, but it seems like there are still some datasets missing from the list that you provided: import mteb
meta = mteb.get_model_meta("Kingsoft-LLM/QZhou-Embedding")
"AmazonCounterfactualClassification" in meta.training_datasets # True
"AskUbuntuDupQuestions" in meta.training_datasets # False
"BQ" in meta.training_datasets # False
"Waimai" in meta.training_datasets # True (fixed)
"TNews" in meta.training_datasets # False (fixed)
"IFlyTek" in meta.training_datasets # False
# do also check the remainder of the list Can I ask you to update the training datasets again? |
Ahh, great, I will rerun the table to see if there are any remaining concerns.
I am back from holiday, so that should be possible. Sorry that you had to wait due to the holiday; normally, it takes no more than 1-2 days.
We, of course, always appreciate collaboration and contributions, but let us keep that out of the review process :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh! forgot to press submit on the review...
I have added the updated table below. There are still a few that seem concerning:
- TwitterSemEval2015
- SCIDOCS
- AskUbuntuDupQuestions
Model Results Comparison
Reference models: intfloat/multilingual-e5-large
, google/gemini-embedding-001
New models evaluated: Kingsoft-LLM/QZhou-Embedding
Tasks: AFQMC
, ATEC
, AmazonCounterfactualClassification
, ArXivHierarchicalClusteringP2P
, ArXivHierarchicalClusteringS2S
, ArguAna
, AskUbuntuDupQuestions
, BIOSSES
, BQ
, Banking77Classification
, BiorxivClusteringP2P.v2
, CLSClusteringP2P
, CLSClusteringS2S
, CMedQAv1-reranking
, CMedQAv2-reranking
, CQADupstackGamingRetrieval
, CQADupstackUnixRetrieval
, ClimateFEVERHardNegatives
, CmedqaRetrieval
, Cmnli
, CovidRetrieval
, DuRetrieval
, EcomRetrieval
, FEVERHardNegatives
, FiQA2018
, HotpotQAHardNegatives
, IFlyTek
, ImdbClassification
, JDReview
, LCQMC
, MMarcoReranking
, MMarcoRetrieval
, MTOPDomainClassification
, MassiveIntentClassification
, MassiveScenarioClassification
, MedicalRetrieval
, MedrxivClusteringP2P.v2
, MedrxivClusteringS2S.v2
, MindSmallReranking
, MultilingualSentiment
, Ocnli
, OnlineShopping
, PAWSX
, QBQTC
, SCIDOCS
, SICK-R
, STS12
, STS13
, STS14
, STS15
, STS17
, STS22.v2
, STSB
, STSBenchmark
, SprintDuplicateQuestions
, StackExchangeClustering.v2
, StackExchangeClusteringP2P.v2
, SummEvalSummarization.v2
, T2Reranking
, T2Retrieval
, TNews
, TRECCOVID
, ThuNewsClusteringP2P
, ThuNewsClusteringS2S
, Touche2020Retrieval.v3
, ToxicConversationsClassification
, TweetSentimentExtractionClassification
, TwentyNewsgroupsClustering.v2
, TwitterSemEval2015
, TwitterURLCorpus
, VideoRetrieval
, Waimai
Results for Kingsoft-LLM/QZhou-Embedding
task_name | Kingsoft-LLM/QZhou-Embedding | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
---|---|---|---|---|
AFQMC | 0.66 | nan | 0.33 | 0.72 |
ATEC | 0.55 | nan | 0.4 | 0.65 |
AmazonCounterfactualClassification | 0.93 | 0.88 | 0.7 | 0.97 |
ArXivHierarchicalClusteringP2P | 0.66 | 0.65 | 0.56 | 0.69 |
ArXivHierarchicalClusteringS2S | 0.64 | 0.64 | 0.54 | 0.65 |
ArguAna | 0.84 | 0.86 | 0.54 | 0.90 |
AskUbuntuDupQuestions | 0.75 | 0.64 | 0.59 | 0.75 |
BIOSSES | 0.93 | 0.89 | 0.85 | 0.97 |
BQ | 0.77 | nan | 0.48 | 0.81 |
Banking77Classification | 0.85 | 0.94 | 0.75 | 0.94 |
BiorxivClusteringP2P.v2 | 0.55 | 0.54 | 0.37 | 0.56 |
CLSClusteringP2P | 0.67 | nan | nan | 0.82 |
CLSClusteringS2S | 0.61 | nan | nan | 0.74 |
CMedQAv1-reranking | 0.94 | nan | 0.68 | 0.94 |
CMedQAv2-reranking | 0.93 | nan | 0.67 | 0.93 |
CQADupstackGamingRetrieval | 0.77 | 0.71 | 0.59 | 0.79 |
CQADupstackUnixRetrieval | 0.70 | 0.54 | 0.4 | 0.72 |
ClimateFEVERHardNegatives | 0.62 | 0.31 | 0.26 | 0.62 |
CmedqaRetrieval | 0.51 | nan | 0.29 | 0.57 |
Cmnli | 0.95 | nan | nan | 0.95 |
CovidRetrieval | 0.93 | 0.79 | 0.76 | 0.96 |
DuRetrieval | 0.92 | nan | 0.85 | 0.94 |
EcomRetrieval | 0.77 | nan | 0.55 | 0.78 |
FEVERHardNegatives | 0.94 | 0.89 | 0.84 | 0.95 |
FiQA2018 | 0.60 | 0.62 | 0.44 | 0.80 |
HotpotQAHardNegatives | 0.80 | 0.87 | 0.71 | 0.87 |
IFlyTek | 0.57 | nan | 0.42 | 0.58 |
ImdbClassification | 0.96 | 0.95 | 0.89 | 0.97 |
JDReview | 0.90 | nan | 0.81 | 0.92 |
LCQMC | 0.82 | nan | 0.76 | 0.82 |
MMarcoReranking | 0.51 | nan | 0.29 | 0.51 |
MMarcoRetrieval | 0.83 | nan | 0.79 | 0.90 |
MTOPDomainClassification | 0.96 | 0.98 | 0.9 | 1.00 |
MassiveIntentClassification | 0.55 | 0.82 | 0.6 | 0.92 |
MassiveScenarioClassification | 0.73 | 0.87 | 0.7 | 0.99 |
MedicalRetrieval | 0.72 | nan | 0.51 | 0.76 |
MedrxivClusteringP2P.v2 | 0.51 | 0.47 | 0.34 | 0.52 |
MedrxivClusteringS2S.v2 | 0.48 | 0.45 | 0.32 | 0.51 |
MindSmallReranking | 0.36 | 0.33 | 0.3 | 0.36 |
MultilingualSentiment | 0.85 | nan | 0.71 | 0.85 |
Ocnli | 0.95 | nan | nan | 0.95 |
OnlineShopping | 0.96 | nan | 0.9 | 0.97 |
PAWSX | 0.70 | nan | 0.15 | 0.70 |
QBQTC | 0.61 | nan | nan | 0.71 |
SCIDOCS | 0.44 | 0.25 | 0.17 | 0.44 |
SICK-R | 0.88 | 0.83 | 0.8 | 0.95 |
STS12 | 0.90 | 0.82 | 0.8 | 0.95 |
STS13 | 0.95 | 0.90 | 0.82 | 0.98 |
STS14 | 0.93 | 0.85 | 0.78 | 0.98 |
STS15 | 0.96 | 0.90 | 0.89 | 0.98 |
STS17 | 0.90 | 0.89 | 0.82 | 0.93 |
STS22.v2 | 0.78 | 0.72 | 0.64 | 0.78 |
STSB | 0.92 | 0.85 | 0.82 | 0.92 |
STSBenchmark | 0.96 | 0.89 | 0.87 | 0.96 |
SprintDuplicateQuestions | 0.98 | 0.97 | 0.93 | 0.98 |
StackExchangeClustering.v2 | 0.76 | 0.92 | 0.46 | 0.92 |
StackExchangeClusteringP2P.v2 | 0.55 | 0.51 | 0.39 | 0.55 |
SummEvalSummarization.v2 | 0.34 | 0.38 | 0.31 | 0.39 |
T2Reranking | 0.68 | 0.68 | 0.66 | 0.73 |
T2Retrieval | 0.81 | nan | 0.76 | 0.89 |
TNews | 0.60 | nan | 0.49 | 0.60 |
TRECCOVID | 0.79 | 0.86 | 0.71 | 0.95 |
ThuNewsClusteringP2P | 0.83 | nan | nan | 0.89 |
ThuNewsClusteringS2S | 0.78 | nan | nan | 0.88 |
Touche2020Retrieval.v3 | 0.49 | 0.52 | 0.5 | 0.75 |
ToxicConversationsClassification | 0.90 | 0.89 | 0.66 | 0.98 |
TweetSentimentExtractionClassification | 0.77 | 0.70 | 0.63 | 0.88 |
TwentyNewsgroupsClustering.v2 | 0.82 | 0.57 | 0.39 | 0.88 |
TwitterSemEval2015 | 0.92 | 0.79 | 0.75 | 0.92 |
TwitterURLCorpus | 0.92 | 0.87 | 0.86 | 0.96 |
VideoRetrieval | 0.80 | nan | 0.58 | 0.84 |
Waimai | 0.92 | nan | 0.86 | 0.92 |
Average | 0.76 | 0.73 | 0.61 | 0.81 |
@PennyYu123 can you help me understand the few concerning datasets? Might there be missing dataset annotations? |
We have concurrently updated the following components: |
PR that updates model revision embeddings-benchmark/mteb#3069 I will recompute the table |
Model Results ComparisonReference models: Results for
|
task_name | Kingsoft-LLM/QZhou-Embedding | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
---|---|---|---|---|
AFQMC | 0.67 | nan | 0.33 | 0.72 |
ATEC | 0.55 | nan | 0.4 | 0.65 |
AmazonCounterfactualClassification | 0.93 | 0.88 | 0.7 | 0.97 |
ArXivHierarchicalClusteringP2P | 0.66 | 0.65 | 0.56 | 0.69 |
ArXivHierarchicalClusteringS2S | 0.64 | 0.64 | 0.54 | 0.65 |
ArguAna | 0.84 | 0.86 | 0.54 | 0.90 |
AskUbuntuDupQuestions | 0.69 | 0.64 | 0.59 | 0.70 |
BIOSSES | 0.93 | 0.89 | 0.85 | 0.97 |
BQ | 0.77 | nan | 0.48 | 0.81 |
Banking77Classification | 0.85 | 0.94 | 0.75 | 0.94 |
BiorxivClusteringP2P.v2 | 0.54 | 0.54 | 0.37 | 0.56 |
CLSClusteringP2P | 0.65 | nan | nan | 0.82 |
CLSClusteringS2S | 0.61 | nan | nan | 0.74 |
CMedQAv1-reranking | 0.94 | nan | 0.68 | 0.94 |
CMedQAv2-reranking | 0.94 | nan | 0.67 | 0.94 |
CQADupstackGamingRetrieval | 0.76 | 0.71 | 0.59 | 0.79 |
CQADupstackUnixRetrieval | 0.71 | 0.54 | 0.4 | 0.72 |
ClimateFEVERHardNegatives | 0.49 | 0.31 | 0.26 | 0.49 |
CmedqaRetrieval | 0.52 | nan | 0.29 | 0.57 |
Cmnli | 0.95 | nan | nan | 0.95 |
CovidRetrieval | 0.93 | 0.79 | 0.76 | 0.96 |
DuRetrieval | 0.92 | nan | 0.85 | 0.94 |
EcomRetrieval | 0.77 | nan | 0.55 | 0.78 |
FEVERHardNegatives | 0.94 | 0.89 | 0.84 | 0.95 |
FiQA2018 | 0.60 | 0.62 | 0.44 | 0.80 |
HotpotQAHardNegatives | 0.81 | 0.87 | 0.71 | 0.87 |
IFlyTek | 0.58 | nan | 0.42 | 0.58 |
ImdbClassification | 0.96 | 0.95 | 0.89 | 0.97 |
JDReview | 0.88 | nan | 0.81 | 0.92 |
LCQMC | 0.82 | nan | 0.76 | 0.82 |
MMarcoReranking | 0.44 | nan | 0.29 | 0.47 |
MMarcoRetrieval | 0.83 | nan | 0.79 | 0.90 |
MTOPDomainClassification | 0.96 | 0.98 | 0.9 | 1.00 |
MassiveIntentClassification | 0.55 | 0.82 | 0.6 | 0.92 |
MassiveScenarioClassification | 0.74 | 0.87 | 0.7 | 0.99 |
MedicalRetrieval | 0.73 | nan | 0.51 | 0.76 |
MedrxivClusteringP2P.v2 | 0.50 | 0.47 | 0.34 | 0.52 |
MedrxivClusteringS2S.v2 | 0.48 | 0.45 | 0.32 | 0.51 |
MindSmallReranking | 0.34 | 0.33 | 0.3 | 0.34 |
MultilingualSentiment | 0.85 | nan | 0.71 | 0.85 |
Ocnli | 0.95 | nan | nan | 0.95 |
OnlineShopping | 0.96 | nan | 0.9 | 0.97 |
PAWSX | 0.70 | nan | 0.15 | 0.70 |
QBQTC | 0.60 | nan | nan | 0.71 |
SCIDOCS | 0.29 | 0.25 | 0.17 | 0.35 |
SICK-R | 0.88 | 0.83 | 0.8 | 0.95 |
STS12 | 0.90 | 0.82 | 0.8 | 0.95 |
STS13 | 0.96 | 0.90 | 0.82 | 0.98 |
STS14 | 0.93 | 0.85 | 0.78 | 0.98 |
STS15 | 0.95 | 0.90 | 0.89 | 0.98 |
STS17 | 0.89 | 0.89 | 0.82 | 0.93 |
STS22.v2 | 0.77 | 0.72 | 0.64 | 0.77 |
STSB | 0.92 | 0.85 | 0.82 | 0.92 |
STSBenchmark | 0.95 | 0.89 | 0.87 | 0.95 |
SprintDuplicateQuestions | 0.98 | 0.97 | 0.93 | 0.98 |
StackExchangeClustering.v2 | 0.76 | 0.92 | 0.46 | 0.92 |
StackExchangeClusteringP2P.v2 | 0.55 | 0.51 | 0.39 | 0.55 |
SummEvalSummarization.v2 | 0.33 | 0.38 | 0.31 | 0.39 |
T2Reranking | 0.68 | 0.68 | 0.66 | 0.73 |
T2Retrieval | 0.82 | nan | 0.76 | 0.89 |
TNews | 0.61 | nan | 0.49 | 0.61 |
TRECCOVID | 0.78 | 0.86 | 0.71 | 0.95 |
ThuNewsClusteringP2P | 0.82 | nan | nan | 0.89 |
ThuNewsClusteringS2S | 0.76 | nan | nan | 0.88 |
Touche2020Retrieval.v3 | 0.50 | 0.52 | 0.5 | 0.75 |
ToxicConversationsClassification | 0.90 | 0.89 | 0.66 | 0.98 |
TweetSentimentExtractionClassification | 0.77 | 0.70 | 0.63 | 0.88 |
TwentyNewsgroupsClustering.v2 | 0.81 | 0.57 | 0.39 | 0.88 |
TwitterSemEval2015 | 0.87 | 0.79 | 0.75 | 0.89 |
TwitterURLCorpus | 0.92 | 0.87 | 0.86 | 0.96 |
VideoRetrieval | 0.79 | nan | 0.58 | 0.84 |
Waimai | 0.92 | nan | 0.86 | 0.92 |
Average | 0.76 | 0.73 | 0.61 | 0.81 |
Alright, I think we finally got there! Congratulations again on the release :) |
We have released the HF model publicly and resubmitted the mteb implementation.
Checklist
mteb/models/
this can be as an API. Instruction on how to add a model can be found here