-
Notifications
You must be signed in to change notification settings - Fork 462
dataset: add BillSum datasets #2943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataset: add BillSum datasets #2943
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that I have reviewed and approve this PR on behalf of Isaacus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the PR! Generally looks good. However, the tasks are not imported, which means that it will not be fetchable with mteb.get_task("{taskname}")
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Thanks for the feedback. I've approved the changes @KennethEnevoldsen |
@abdurrahmanbutler you needed to add imports to the
|
Hi,
I’m submitting this pull request to push the Californian and US splits of the BillSum dataset to MTEB.
BillSum is a dataset created by FiscalNote for the purposes of training and evaluating models capable of summarizing federal and state legislation.
We have reframed the problem in terms of the retrieval of bills based on their summaries, making our reformatted datasets suitable for the evaluation of legal information retrieval models.
We want to improve the coverage of legal domain tasks on MTEB and we believe this dataset will contribute to increasing the diversity and difficulty of MTEB.
This pull request is being submitted courtesy of Isaacus, a legal AI research company.
You may find the original dataset here:
https://huggingface.co/datasets/FiscalNote/billsum
Note that the original dataset contained a large number of examples in both the federal US and Californian test splits and so, we have reduced both splits to 500 randomly selected examples.
Checklist