Skip to content

Evaluation of Retrieval Tasks: Occasional Occurrences of NaN and Result Inconsistency in Repeated Runs #3030

@YanshekWoo

Description

@YanshekWoo

Describe the bug

Observed Phenomenon:

  • NaN values appear in the evaluation result JSON files
  • Inconsistent evaluation results across multiple runs—metrics occasionally spike unusually high or low (e.g., NDCG@10 sometimes exceeds 1)

Examples of Problematic Results on the Leaderboard:

To reproduce

Reproduction Steps:

Using MTEB evaluation outputs for TwitterHjerneRetrieval (saved during runtime, including qrels and results):
Input data link: mteb_retrieval_dummy_test.json

Statistical calculation code:

import json
from mteb.evaluation.evaluators.RetrievalEvaluator import RetrievalEvaluator

file_path = "mteb_retrieval_dummy_test.json"
with open(file_path) as f:
    data = json.load(f)
qrels = data["qrels"]
results = data["results"]

k_values = [1, 3, 5, 10, 20, 100, 1000]
ndcg, _map, recall, precision, naucs = RetrievalEvaluator.evaluate(qrels, results, k_values)

print("-" * 50)
print(ndcg)

Possible outcomes across multiple runs (re-executing the Python script; note: repeated calls within the same script do not exhibit this issue):

# Run 1 (exceeds 1)
{'NDCG@1': 0.98718, 'NDCG@3': 1.479, 'NDCG@5': 1.48225, 'NDCG@10': 1.4872, 'NDCG@20': 1.4872, 'NDCG@100': 1.49091, 'NDCG@1000': 1.492}

# Run 2
{'NDCG@1': 0.98718, 'NDCG@3': 0.77988, 'NDCG@5': 0.61173, 'NDCG@10': 0.45984, 'NDCG@20': 0.35702, 'NDCG@100': 0.23653, 'NDCG@1000': 0.18541}
Environment:

OS: Linux
Packages:

  • mteb==1.38.39
  • pytrec_eval==0.5

Additional information

Suspected Issue (Unconfirmed):

Potential bugs in pytrec_eval scoring calculations.
Relevant MTEB code location:

evaluator = pytrec_eval.RelevanceEvaluator(

Are you interested to contribute a fix for this bug?

No

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions