-
Notifications
You must be signed in to change notification settings - Fork 462
Add HumanEvalRetrieval task #3022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HumanEvalRetrieval task #3022
Conversation
0822ee1
to
e6d4f65
Compare
- Use TaskMetadata class instead of dict - Remove descriptive_stats as requested in PR review - Add date field and proper import structure
- Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval - Use actual description from HuggingFace dataset page - Remove fabricated citation and reference - Remove revision field that was incorrect - Reference HuggingFace dataset page instead of arxiv
- Add revision hash: ed1f48a for reproducibility
- Add date field for metadata completeness - Add bibtex_citation field (empty string) - Required for TaskMetadata validation to pass - Should resolve PR test failure
- Remove trust_remote_code parameter as requested - Add revision parameter to load_dataset() calls for consistency - Use metadata revision hash in dataset loading for reproducibility
0f4650d
to
cb6e750
Compare
Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format.
cb6e750
to
3bceb9e
Compare
Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility.
- Organize data by splits as expected by MTEB retrieval tasks - Convert scores to integers for pytrec_eval compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I ask you to also compute the desc. stats as well using:
task = mteb.get_task(name)
task.calculate_metadata_metrics() # create file in the correct place
I also suggested a few updated to the metadata. Generally it should be possible for the user to read the description and have a fair understanding of how a sample might look like (what is the query, what does the corpus contain).
7c6c66e
to
ac13d9f
Compare
- Add descriptive statistics using calculate_metadata_metrics() - Enhance metadata description with dataset structure details - Add complete BibTeX citation for original paper - Update to full commit hash revision - Add python-Code language tag for programming language - Explain retrieval task formulation clearly
ac13d9f
to
124de9e
Compare
- Update citation to match bibtexparser formatting requirements - Fields now in alphabetical order with lowercase names - Proper trailing commas and indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks for the PR
#3014
HumanEval dataset for code retrieval tasks
158 queries, 158 documents
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)