Skip to content

dataset: add BarExamQA dataset #2916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

abdurrahmanbutler
Copy link
Contributor

@abdurrahmanbutler abdurrahmanbutler commented Jul 21, 2025

Hi,
I’m submitting this pull request to push BarExamQA to MTEB.

BarExamQA is a dataset created by RegLab for the purposes of evaluating models on the retrieval of relevant legal provisions.

BarExamQA contains over 100 questions taken from the bar exams of states around the US, with law students having manually identified the most relevant legal provisions to each question.

We would like to improve the coverage of legal domain tasks on MTEB and we believe this dataset will contribute to increasing the diversity and difficulty of MTEB.

This pull request is being submitted courtesy of Isaacus, a legal AI research company.

You may find the original dataset here:
https://huggingface.co/datasets/reglab/barexam_qa

Our version of the dataset after having been converted to the MTEB information retrieval dataset format is available here:
https://huggingface.co/datasets/isaacus/mteb-barexam-qa

Checklist

  • I have outlined why this dataset is filling an existing gap in mteb
  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)

@abdurrahmanbutler abdurrahmanbutler changed the title Dataset: Bar Dataset: add BarExamQA dataset Jul 21, 2025
Copy link

@umarbutler umarbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that I have reviewed and approve this PR on behalf of Isaacus.

@Samoed
Copy link
Member

Samoed commented Jul 21, 2025

Left comment about results of model in results repo PR embeddings-benchmark/results#240 (comment)

@Samoed Samoed changed the title Dataset: add BarExamQA dataset dataset: add BarExamQA dataset Jul 21, 2025
@Samoed Samoed merged commit 1dcc6dc into embeddings-benchmark:main Jul 21, 2025
9 checks passed
@abdurrahmanbutler
Copy link
Contributor Author

LGTM! Thanks @Samoed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants