Skip to content

Conversation

Hypothesis-Z
Copy link
Contributor

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not on the evaluation dataset including training splits. If I have I have disclosed it clearly.

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Jun 9, 2025

Results for GeoGPT-Research-Project/GeoEmbedding

task_name GeoGPT-Research-Project/GeoEmbedding google/gemini-embedding-001 intfloat/multilingual-e5-large
AmazonCounterfactualClassification 0.97 0.88 0.7
ArXivHierarchicalClusteringP2P 0.65 0.65 0.56
ArXivHierarchicalClusteringS2S 0.64 0.64 0.54
ArguAna 0.78 0.86 0.54
AskUbuntuDupQuestions 0.65 0.64 0.59
BIOSSES 0.84 0.89 0.85
Banking77Classification 0.92 0.94 0.75
BiorxivClusteringP2P.v2 0.48 0.54 0.37
CQADupstackGamingRetrieval 0.65 0.71 0.59
CQADupstackUnixRetrieval 0.50 0.54 0.4
ClimateFEVERHardNegatives 0.43 0.31 0.26
FEVERHardNegatives 0.93 0.89 0.84
FiQA2018 0.53 0.62 0.44
HotpotQAHardNegatives 0.73 0.87 0.71
ImdbClassification 0.92 0.95 0.89
MTOPDomainClassification 0.98 0.98 0.9
MassiveIntentClassification 0.86 0.82 0.6
MassiveScenarioClassification 0.90 0.87 0.7
MedrxivClusteringP2P.v2 0.46 0.47 0.34
MedrxivClusteringS2S.v2 0.48 0.45 0.32
MindSmallReranking 0.32 0.33 0.3
SCIDOCS 0.22 0.25 0.17
SICK-R 0.80 0.83 0.8
STS12 0.68 0.82 0.8
STS13 0.82 0.90 0.82
STS14 0.78 0.85 0.78
STS15 0.87 0.90 0.89
STS17 0.90 0.89 0.82
STS22.v2 0.72 0.72 0.64
STSBenchmark 0.84 0.89 0.87
SprintDuplicateQuestions 0.94 0.97 0.93
StackExchangeClustering.v2 0.54 0.92 0.46
StackExchangeClusteringP2P.v2 0.40 0.51 0.39
SummEvalSummarization.v2 0.30 0.38 0.31
TRECCOVID 0.77 0.86 0.71
Touche2020Retrieval.v3 0.54 0.52 0.5
ToxicConversationsClassification 0.85 0.89 0.66
TweetSentimentExtractionClassification 0.77 0.70 0.63
TwentyNewsgroupsClustering.v2 0.88 0.57 0.39
TwitterSemEval2015 0.68 0.79 0.75
TwitterURLCorpus 0.86 0.87 0.86
Average 0.70 0.73 0.62

Noteworthy scores include TwentyNewsgroupsClustering.v2, TweetSentimentExtractionClassification, ClimateFEVERHardNegatives, AmazonCounterfactualClassification, FEVERHardNegatives

Double checked these with the leaderboard, where the following looks concerning:

  • TwentyNewsgroupsClustering.v2: highest is .68
  • AmazonCounterfactualClassification: Highest is ~.93

AmazonCounterfactualClassification is partly explained by training on the dataset. @Hypothesis-Z can you help me understand TwentyNewsgroupsClustering.v2?

@Hypothesis-Z
Copy link
Contributor Author

Hi @KennethEnevoldsen, thank you for double check in fine detail.

I have checked the training datasets and the model metadata, and the training datasets of MTEB classification and clustering tasks include:

  • Training Datasets:
    • amazoncounterfactualclassification
    • amazonpolarityclassification
    • amazonreviewsclassification
    • banking77classification
    • emotionclassification
    • massiveintentclassification
    • massivescenarioclassification
    • mtopdomainclassification
    • mtopintentclassification
    • toxicconversationsclassification
    • tweetsentimentextractionclassification
    • arxivclusteringp2p
    • arxivclusterings2s
    • biorxivclusteringp2p
    • biorxivclusterings2s
    • medrxivclusteringp2p
    • medrxivclusterings2s
    • twentynewsgroupsclustering

I will open a PR to revise model metadata since the task TwentyNewsgroupsClustering was not listed as expected.

I've also checked the other tasks and there's no omission now.

@Hypothesis-Z
Copy link
Contributor Author

Hypothesis-Z commented Jun 10, 2025

New PR: embeddings-benchmark/mteb#2802

@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) June 10, 2025 20:29
@KennethEnevoldsen KennethEnevoldsen merged commit 95a8aeb into embeddings-benchmark:main Jun 10, 2025
2 checks passed
@KennethEnevoldsen
Copy link
Contributor

Thanks for the update @Hypothesis-Z - will merge this in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants