Evaluation of Retrieval Tasks: Occasional Occurrences of NaN and Result Inconsistency in Repeated Runs

### Describe the bug

#### Observed Phenomenon:
- `NaN` values appear in the evaluation result JSON files
- Inconsistent evaluation results across multiple runs—metrics occasionally spike unusually high or low (e.g., NDCG@10 sometimes exceeds 1)

### Examples of Problematic Results on the Leaderboard:
- https://github.com/embeddings-benchmark/results/blob/ec2a94ac223ae625ac206523d2b88361bcfbf8e1/results/Salesforce__SFR-Embedding-Mistral/938c560d1c236aa563b2dbdf084f28ab28bccb11/SyntecRetrieval.json#L94
- https://github.com/embeddings-benchmark/results/blob/ec2a94ac223ae625ac206523d2b88361bcfbf8e1/results/Qwen__Qwen3-Embedding-8B/4e423935c619ae4df87b646a3ce949610c66241c/LEMBPasskeyRetrieval.json#L43

### To reproduce

#### Reproduction Steps:
Using MTEB evaluation outputs for TwitterHjerneRetrieval (saved during runtime, including qrels and results):
Input data link: [mteb_retrieval_dummy_test.json](https://github.com/user-attachments/files/21772801/mteb_retrieval_dummy_test.json)

Statistical calculation code:
```python
import json
from mteb.evaluation.evaluators.RetrievalEvaluator import RetrievalEvaluator

file_path = "mteb_retrieval_dummy_test.json"
with open(file_path) as f:
    data = json.load(f)
qrels = data["qrels"]
results = data["results"]

k_values = [1, 3, 5, 10, 20, 100, 1000]
ndcg, _map, recall, precision, naucs = RetrievalEvaluator.evaluate(qrels, results, k_values)

print("-" * 50)
print(ndcg)
```

Possible outcomes across multiple runs (re-executing the Python script; note: repeated calls within the same script do not exhibit this issue):
```
# Run 1 (exceeds 1)
{'NDCG@1': 0.98718, 'NDCG@3': 1.479, 'NDCG@5': 1.48225, 'NDCG@10': 1.4872, 'NDCG@20': 1.4872, 'NDCG@100': 1.49091, 'NDCG@1000': 1.492}

# Run 2
{'NDCG@1': 0.98718, 'NDCG@3': 0.77988, 'NDCG@5': 0.61173, 'NDCG@10': 0.45984, 'NDCG@20': 0.35702, 'NDCG@100': 0.23653, 'NDCG@1000': 0.18541}
```

##### Environment:
OS: Linux
Packages: 
- mteb==1.38.39
- pytrec_eval==0.5

### Additional information

#### Suspected Issue (Unconfirmed):
Potential bugs in [pytrec_eval](https://github.com/cvangysel/pytrec_eval/tree/master) scoring calculations.
Relevant MTEB code location:
https://github.com/embeddings-benchmark/mteb/blob/177997f0144690eac0c082e5f973cfc89cdaa331/mteb/evaluation/evaluators/RetrievalEvaluator.py#L531

### Are you interested to contribute a fix for this bug?

No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation of Retrieval Tasks: Occasional Occurrences of NaN and Result Inconsistency in Repeated Runs #3030

Describe the bug

Observed Phenomenon:

Examples of Problematic Results on the Leaderboard:

To reproduce

Reproduction Steps:

Environment:

Additional information

Suspected Issue (Unconfirmed):

Are you interested to contribute a fix for this bug?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation of Retrieval Tasks: Occasional Occurrences of NaN and Result Inconsistency in Repeated Runs #3030

Description

Describe the bug

Observed Phenomenon:

Examples of Problematic Results on the Leaderboard:

To reproduce

Reproduction Steps:

Environment:

Additional information

Suspected Issue (Unconfirmed):

Are you interested to contribute a fix for this bug?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions