Add HumanEvalRetrieval task #3022

fzoll · 2025-08-10T13:40:56Z

HumanEval dataset for code retrieval tasks
158 queries, 158 documents
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)

mteb/tasks/Retrieval/code/HumanEvalRetrieval.py

- Use TaskMetadata class instead of dict - Remove descriptive_stats as requested in PR review - Add date field and proper import structure

- Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval - Use actual description from HuggingFace dataset page - Remove fabricated citation and reference - Remove revision field that was incorrect - Reference HuggingFace dataset page instead of arxiv

- Add revision hash: ed1f48a for reproducibility

- Add date field for metadata completeness - Add bibtex_citation field (empty string) - Required for TaskMetadata validation to pass - Should resolve PR test failure

mteb/tasks/Retrieval/code/HumanEvalRetrieval.py

- Remove trust_remote_code parameter as requested - Add revision parameter to load_dataset() calls for consistency - Use metadata revision hash in dataset loading for reproducibility

Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format.

Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility.

- Organize data by splits as expected by MTEB retrieval tasks - Convert scores to integers for pytrec_eval compatibility

KennethEnevoldsen

Can I ask you to also compute the desc. stats as well using:

task = mteb.get_task(name)
task.calculate_metadata_metrics() # create file in the correct place

I also suggested a few updated to the metadata. Generally it should be possible for the user to read the description and have a fair understanding of how a sample might look like (what is the query, what does the corpus contain).

mteb/tasks/Retrieval/code/HumanEvalRetrieval.py

- Add descriptive statistics using calculate_metadata_metrics() - Enhance metadata description with dataset structure details - Add complete BibTeX citation for original paper - Update to full commit hash revision - Add python-Code language tag for programming language - Explain retrieval task formulation clearly

- Update citation to match bibtexparser formatting requirements - Fields now in alphabetical order with lowercase names - Proper trailing commas and indentation

KennethEnevoldsen

Looks good to me! Thanks for the PR

Add HumanEvalRetrieval dataset

e6d4f65

fzoll force-pushed the add-humaneval-retrieval branch from 0822ee1 to e6d4f65 Compare August 10, 2025 14:37

Samoed reviewed Aug 10, 2025

View reviewed changes

fzoll added 4 commits August 10, 2025 23:01

Fix TaskMetadata structure and remove descriptive_stats

4b94492

- Use TaskMetadata class instead of dict - Remove descriptive_stats as requested in PR review - Add date field and proper import structure

Add correct revision hash to HumanEval

a6567bc

- Add revision hash: ed1f48a for reproducibility

Fix HumanEval metadata validation

7fa2e56

- Add date field for metadata completeness - Add bibtex_citation field (empty string) - Required for TaskMetadata validation to pass - Should resolve PR test failure

Samoed reviewed Aug 11, 2025

View reviewed changes

mteb/tasks/Retrieval/code/HumanEvalRetrieval.py Outdated Show resolved Hide resolved

mteb/tasks/Retrieval/code/HumanEvalRetrieval.py Outdated Show resolved Hide resolved

Address reviewer feedback

97beeb8

- Remove trust_remote_code parameter as requested - Add revision parameter to load_dataset() calls for consistency - Use metadata revision hash in dataset loading for reproducibility

fzoll force-pushed the add-humaneval-retrieval branch from 0f4650d to cb6e750 Compare August 11, 2025 11:14

Fix field names in HumanEval dataset loading

3bceb9e

Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format.

fzoll force-pushed the add-humaneval-retrieval branch from cb6e750 to 3bceb9e Compare August 11, 2025 11:19

fzoll requested a review from Samoed August 11, 2025 11:20

Fix deprecated metadata_dict usage

27736d1

Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility.

Samoed requested a review from KennethEnevoldsen August 11, 2025 12:19

Fix data structure for MTEB compatibility

6a10405

- Organize data by splits as expected by MTEB retrieval tasks - Convert scores to integers for pytrec_eval compatibility

KennethEnevoldsen requested changes Aug 16, 2025

View reviewed changes

fzoll force-pushed the add-humaneval-retrieval branch from 7c6c66e to ac13d9f Compare August 16, 2025 21:52

fzoll force-pushed the add-humaneval-retrieval branch from ac13d9f to 124de9e Compare August 16, 2025 21:56

fzoll requested a review from KennethEnevoldsen August 16, 2025 21:57

Fix BibTeX citation formatting for HumanEvalRetrieval

3e0bbbe

- Update citation to match bibtexparser formatting requirements - Fields now in alphabetical order with lowercase names - Proper trailing commas and indentation

KennethEnevoldsen approved these changes Aug 17, 2025

View reviewed changes

KennethEnevoldsen merged commit d4e6223 into embeddings-benchmark:main Aug 17, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add HumanEvalRetrieval task #3022

Add HumanEvalRetrieval task #3022

Uh oh!

fzoll commented Aug 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Add HumanEvalRetrieval task #3022

Add HumanEvalRetrieval task #3022

Uh oh!

Conversation

fzoll commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fzoll commented Aug 10, 2025 •

edited

Loading

KennethEnevoldsen left a comment •

edited

Loading