-
Notifications
You must be signed in to change notification settings - Fork 463
Open
Description
Describe the bug
Observed Phenomenon:
NaN
values appear in the evaluation result JSON files- Inconsistent evaluation results across multiple runs—metrics occasionally spike unusually high or low (e.g., NDCG@10 sometimes exceeds 1)
Examples of Problematic Results on the Leaderboard:
- https://github.com/embeddings-benchmark/results/blob/ec2a94ac223ae625ac206523d2b88361bcfbf8e1/results/Salesforce__SFR-Embedding-Mistral/938c560d1c236aa563b2dbdf084f28ab28bccb11/SyntecRetrieval.json#L94
- https://github.com/embeddings-benchmark/results/blob/ec2a94ac223ae625ac206523d2b88361bcfbf8e1/results/Qwen__Qwen3-Embedding-8B/4e423935c619ae4df87b646a3ce949610c66241c/LEMBPasskeyRetrieval.json#L43
To reproduce
Reproduction Steps:
Using MTEB evaluation outputs for TwitterHjerneRetrieval (saved during runtime, including qrels and results):
Input data link: mteb_retrieval_dummy_test.json
Statistical calculation code:
import json
from mteb.evaluation.evaluators.RetrievalEvaluator import RetrievalEvaluator
file_path = "mteb_retrieval_dummy_test.json"
with open(file_path) as f:
data = json.load(f)
qrels = data["qrels"]
results = data["results"]
k_values = [1, 3, 5, 10, 20, 100, 1000]
ndcg, _map, recall, precision, naucs = RetrievalEvaluator.evaluate(qrels, results, k_values)
print("-" * 50)
print(ndcg)
Possible outcomes across multiple runs (re-executing the Python script; note: repeated calls within the same script do not exhibit this issue):
# Run 1 (exceeds 1)
{'NDCG@1': 0.98718, 'NDCG@3': 1.479, 'NDCG@5': 1.48225, 'NDCG@10': 1.4872, 'NDCG@20': 1.4872, 'NDCG@100': 1.49091, 'NDCG@1000': 1.492}
# Run 2
{'NDCG@1': 0.98718, 'NDCG@3': 0.77988, 'NDCG@5': 0.61173, 'NDCG@10': 0.45984, 'NDCG@20': 0.35702, 'NDCG@100': 0.23653, 'NDCG@1000': 0.18541}
Environment:
OS: Linux
Packages:
- mteb==1.38.39
- pytrec_eval==0.5
Additional information
Suspected Issue (Unconfirmed):
Potential bugs in pytrec_eval scoring calculations.
Relevant MTEB code location:
evaluator = pytrec_eval.RelevanceEvaluator( |
Are you interested to contribute a fix for this bug?
No
Metadata
Metadata
Assignees
Labels
No labels