Implements check on existing and new datasets

We currently find a lot of inconsistencies in added datasets (e.g. #1043, #407, #1036)

We can naturally fix these as they arise, but it would be ideal to have a test which for each dataset checks if it of "high-quality", these checks could e.g. include:

1) Checking that there are no empty documents
2) Checking that the task contains no duplicates
3) Checking leakage between train and test sets
4) Optionally we could add the existing computed metrics here as well (e..g avg. length)

We can then write a file for a specific dataset / revision to compute these metrics.

Other tests such as checking if the language match etc. could also be added in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implements check on existing and new datasets #1049

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implements check on existing and new datasets #1049

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions