-
Notifications
You must be signed in to change notification settings - Fork 289
Evaluation suite #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation suite #302
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mathemakitten, this looks really good and clean! I left some comments and maybe @lhoestq also has some opinions
99569e2
to
0495810
Compare
Thanks @lvwerra, have added all those suggestions! If the API looks good to you now I'll add some tests assuming this format :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For you information there are evaluation metadata already in the dataset cards that contain the same info as your JSON file. See the train_eval_index
content of glue
for example: https://huggingface.co/datasets/glue/blob/main/README.md#L29, or even the imdb
one that also contain the metrics to use: https://huggingface.co/datasets/imdb/blob/main/README.md#L32
I think it would be nice if we could reuse train_eval_index
to not end up with duplicate info scattered in the Hub. What do you think ?
@lhoestq maybe we can use the |
I see ! Users could overwrite it with kwargs in |
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Thanks for the pointer re: |
Cool ! Overall I'd be in favor of not introducing new files or metadata if we can use something that exists already - this is less confusing for users |
Could there be a way to pass |
OK, I had a look at how
|
Oh ok I see ! Let's use new config files then ^^' you can use a space to store it and have a custom UI to present it. In a dataset repo you can only do markdown, and it would be weird to show a preview of a config file Note that enriching a bit train_eval_index is maybe possible cc @lewtun |
okok will do! That makes sense re: markdown and config, thanks @lhoestq :) In that case I'll keep the EvaluationSuite config. I appreciate you bringing up the |
Following-up on last week's discussion, here are some ideas how we could frame the In @dataclass
def SubTask:
model_or_pipeline: Optional[Union[str, Model]] = None
data: Optional[Union[str, Dataset, Array]] = None
subset: Optional[str] = None
split: Optional[str] = None
data_preprocessor: Optional[Callable] = None On the Hub or locally ( preprocessor = lambda x: x["text"].lower()
suite = [
SubTask(
data="imdb",
split="test",
data_preprocessor=preprocessor
),
SubTask(
data="sst2",
split="validation",
data_preprocessor=preprocessor
)
] User ( from evalute import Suite
suite = Suite.load("lvwerra/sentiment_suite") #from hub
suite.run("lvwerra/distilbert-imdb") We could also support loading from |
Thanks so much for this @lvwerra ! The idea of a preprocessor for the data that you can pass in is a great idea. My thought was that it could work something like this.
Does that make sense? In this way, data would only need to be processed once for all the ways it can be used. Regarding this bit:
I'm supportive of not using json, especially if not using it allows us to use a data_preprocessor. =) |
Cool ! The "EvaluationSuite" sounds clear to me and also how to define it, nice ! ^^ Other ideas of name for "SubTask" (since "task" is already used in the HF ecosystem for other things): maybe something like "EvaluationData" or "DataSource" Could you also give more details on why there's "model_or_pipeline" in what you call "SubTask" ? |
|
Awesome! And the SubTask/Task/EvaluationJob would receive a |
Thanks @lvwerra, this is super helpful for understanding your proposal!
Question — where should data preprocessing live if it's not being used with an evaluation suite, but rather just as an extra preprocessing step before the data gets fed into |
Thanks. =) Yes, we discussed this earlier, and it's good by me. I do want to clarify, though: The .map function is something we would have in-place within our code, which operates on a |
Thanks for the feedback and your thoughts. Answering the open questions:
|
Superceded by #337! |
Requires #301 to be merged first (lots of these diffs on data loading in the evaluator code will disappear afterward).You can run the GLUE suite of tasks like this, for example:
where the JSON file contains metadata on the dataset and args to pass to the evaluator. The example config for GLUE is here.
After this core logic is nailed down I'll move toward incorporating a larger set of tasks (e.g. the LM harness by EleutherAI, which requires support for multiple evaluators for one model). Also, qqp is actually a semantic similarity task and should be evaluated as such, we don't yet have the right evaluator implemented, though.