-
Notifications
You must be signed in to change notification settings - Fork 464
Closed
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Description
We currently find a lot of inconsistencies in added datasets (e.g. #1043, #407, #1036)
We can naturally fix these as they arise, but it would be ideal to have a test which for each dataset checks if it of "high-quality", these checks could e.g. include:
- Checking that there are no empty documents
- Checking that the task contains no duplicates
- Checking leakage between train and test sets
- Optionally we could add the existing computed metrics here as well (e..g avg. length)
We can then write a file for a specific dataset / revision to compute these metrics.
Other tests such as checking if the language match etc. could also be added in the future.
isaac-chung
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed