Evaluation suite #302

mathemakitten · 2022-09-26T16:44:16Z

~~Requires #301 to be merged first (lots of these diffs on data loading in the evaluator code will disappear afterward).~~

You can run the GLUE suite of tasks like this, for example:

from evaluate.evaluation_suite import load_evaluation_suite
evaluation_suite = load_evaluation_suite('mathemakitten/glue-suite')
results = evaluation_suite.run(model_or_pipeline='gpt2')

where the JSON file contains metadata on the dataset and args to pass to the evaluator. The example config for GLUE is here.

After this core logic is nailed down I'll move toward incorporating a larger set of tasks (e.g. the LM harness by EleutherAI, which requires support for multiple evaluators for one model). Also, qqp is actually a semantic similarity task and should be evaluated as such, we don't yet have the right evaluator implemented, though.

HuggingFaceDocBuilderDev · 2022-09-26T16:47:13Z

The documentation is not available anymore as the PR was closed or merged.

lvwerra

Hi @mathemakitten, this looks really good and clean! I left some comments and maybe @lhoestq also has some opinions

src/evaluate/harness/__init__.py

src/evaluate/loading.py

src/evaluate/harness/__init__.py

mathemakitten · 2022-09-28T16:43:19Z

Thanks @lvwerra, have added all those suggestions! If the API looks good to you now I'll add some tests assuming this format :)

lhoestq

For you information there are evaluation metadata already in the dataset cards that contain the same info as your JSON file. See the train_eval_index content of glue for example: https://huggingface.co/datasets/glue/blob/main/README.md#L29, or even the imdb one that also contain the metrics to use: https://huggingface.co/datasets/imdb/blob/main/README.md#L32

I think it would be nice if we could reuse train_eval_index to not end up with duplicate info scattered in the Hub. What do you think ?

src/evaluate/harness/__init__.py

lvwerra · 2022-10-05T14:17:30Z

@lhoestq maybe we can use the train_eval_index by default if nothing is specified and have the option to overwrite (if a train_eval_index is available) or provide (if no train_eval_index is available). I think in some cases people might want to setup a custom setup?

lhoestq · 2022-10-05T15:28:46Z

I see ! Users could overwrite it with kwargs in load_harness. And to have a full custom setup you can simply copy the dataset repository and modify the train_eval_index directly ?

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

mathemakitten · 2022-10-17T03:33:04Z

Thanks for the pointer re: train_eval_index! Yes, we can definitely use the existing train_eval_index by default if nothing is specified and overwrite with the JSON config if needed. Will incorporate.

…into hn-harness

lhoestq · 2022-10-17T15:12:58Z

Cool ! Overall I'd be in favor of not introducing new files or metadata if we can use something that exists already - this is less confusing for users

NimaBoscarino · 2022-10-17T23:59:18Z

Could there be a way to pass config_name down to a metric? Right now I think we only have access to metric as a string in the args_for_task, which means that AFAIK we can't pass a config_name or any other arguments for the metric.

mathemakitten · 2022-10-18T16:42:37Z

OK, I had a look at how train_eval_index is formulated and I'm having a bit of trouble reconciling how to get all the features we need out of it. After looking at a few datasets with train_eval_index I am still leaning toward having a separate EvaluationSuite config file, but maybe if the fields aren't specified, then we can pull them from train_eval_index? Either way I don't think we can avoid requiring new metadata in a config somewhere, but let me know what you think? Sorry in advance for the long read! 🤗

The upside of the evaluation suite should be that it allows you to compose arbitrary existing datasets from the Hub together for evaluation.
If I wanted to evaluate on just a subset of GLUE, is there a way that I can extract just those two tasks without writing another config or custom code or cloning the dataset repo? For example, if I wanted to evaluate on x tasks from GLUE as well as y tasks from SuperGLUE, I want to reuse the existing datasets for those instead of creating a whole new dataset, copying over the data and specify a custom train_eval_index. It seems like we would have a lot of data duplication if everyone created new custom datasets (one copy for each combination of tasks we want to evaluate on). The upside of having an evaluation suite have its own config is that we can avoid copying the dataset repo and just reuse existing datasets on the Hub by composing them and using the Evaluator's data loading logic to load the subsets we need.
Since load_evaluation_suite should work with multiple different types of evaluators and it would get really messy quickly to have a big nested data structure full of kwargs for each evaluator passed around in the code, I think we should avoid passing custom kwargs to the load_evaluation_suite function. I tried to mirror the way that metric modules are loaded right now where ideally we'd like for someone to be able to simply upload a config file for the evaluation suite and the code should just load it by name without extra kwargs.
If we want to allow passing down config_name to metrics we would need to specify that somewhere as well; either in the train_eval_index or in a specific evaluation suite config file. Right now the attributes outlined in train_eval_index seem to all be attributes of the dataset itself. However, the evaluator also requires other kwargs which are not present in train_eval_index; e.g. "label_mapping": {"LABEL_0": 0.0, "LABEL_1": 1.0} — this is specific to a dataset-evaluator pair, and not an attribute of the dataset alone, so it wouldn't make sense to add it to the train_eval_index, right? If we don't want to modify the fields allowed in train_eval_index and they should remain dataset-specific, we need to pass around these specific kwargs somewhere (and see above for why I don't think it should be passed in load_evaluation_suite).

lhoestq · 2022-10-18T17:20:01Z

Oh ok I see ! Let's use new config files then ^^' you can use a space to store it and have a custom UI to present it. In a dataset repo you can only do markdown, and it would be weird to show a preview of a config file

Note that enriching a bit train_eval_index is maybe possible cc @lewtun

mathemakitten · 2022-10-18T20:48:41Z

okok will do! That makes sense re: markdown and config, thanks @lhoestq :) In that case I'll keep the EvaluationSuite config. I appreciate you bringing up the train_eval_index though because I didn't know that existed and I also generally agree with trying to maximize code reuse whenever possible. We should definitely consider extending train_eval_index to be usable with evaluate though, it just might be a bit of a bigger undertaking.

lvwerra · 2022-10-24T19:19:16Z

Following-up on last week's discussion, here are some ideas how we could frame the SubTasks (placeholder name, I am sure there is a better one) of the EvaluationSuite as classes/dataclasses. In discussions it became clear that some datasets will need some preprocessing before passing them to the evaluator and the question is where to put that logic. Here I propose to define it in a Python script along with the SubTask definition.

In evaluate:

@dataclass
def SubTask:
    model_or_pipeline: Optional[Union[str, Model]] = None
    data: Optional[Union[str, Dataset, Array]] = None
    subset: Optional[str] = None
    split: Optional[str] = None
    data_preprocessor: Optional[Callable] = None

On the Hub or locally (lvwerra/sentiment_suite.py):

preprocessor = lambda x: x["text"].lower()

suite = [
    SubTask(
        data="imdb",
        split="test",
        data_preprocessor=preprocessor
    ),
    SubTask(
        data="sst2",
        split="validation",
        data_preprocessor=preprocessor
    )
]

User (notebook_17.ipynb):

from evalute import Suite

suite = Suite.load("lvwerra/sentiment_suite") #from hub
suite.run("lvwerra/distilbert-imdb")

We could also support loading from JSON but in that case I think we should disable the data_preprocessor argument and instead point the class documentation. I think it could get complicated with the importing a Python function/module via JSON. This is just a draft/skeleton - let me know what you think!

meg-huggingface · 2022-10-24T20:14:51Z

Thanks so much for this @lvwerra ! The idea of a preprocessor for the data that you can pass in is a great idea.
My main concern about the given approach is that it looks like you would have to preprocess the same data for each task using the same dataset, rather than preprocessing the data once and then passing it in for each. Is that right? If so, that's a lot of duplicated work for a time-consuming computation.

My thought was that it could work something like this.
Assuming you have 2 tasks you want to run. The first one might be called for example "multi_classification", for classifying content among multiple labels. The second one, "binary_classification", for a basic model.

python
preprocessor = lambda x: x["text"].lower()
 
suite = [
TaskDataset(
  data="imdb",
  split="test",
  data_preprocessor=preprocessor,
  Task("multi_classification"),
  Task("binary_classification")
),
TaskDataset(
   data="sst2",
   split="test",
   data_preprocessor=preprocessor,
   SubTask("multi_classification"),
   SubTask("binary_classification")
)
]

Does that make sense? In this way, data would only need to be processed once for all the ways it can be used.

Regarding this bit:

We could also support loading from JSON but in that case I think we should disable the data_preprocessor argument and instead point the class documentation. I think it could get complicated with the importing a Python function/module via JSON. This is just a draft/skeleton - let me know what you think!

I'm supportive of not using json, especially if not using it allows us to use a data_preprocessor. =)

lhoestq · 2022-10-24T22:46:51Z

Cool ! The "EvaluationSuite" sounds clear to me and also how to define it, nice ! ^^

Other ideas of name for "SubTask" (since "task" is already used in the HF ecosystem for other things): maybe something like "EvaluationData" or "DataSource"

Could you also give more details on why there's "model_or_pipeline" in what you call "SubTask" ?

lvwerra · 2022-10-25T11:33:41Z

Duplicate work: If we stick to datasets and use ds.map(data_preprocessor) then the processing would be cached by default. So if two tasks would use the same dataset + configuration (via split and subset) + data_preprocessor then the result would be loaded from cache rather than recomputed (even on TB scale this only takes a few minutes). Does that address your concern @meg-huggingface?
model_or_pipeline: Also not entirely sure, copied it from @mathemakitten's implementation :) The classical evaluation benchmark would take this as input I guess and assume dataset and metric as given. Maybe the idea is to build a data evaluation benchmark with it too, where the input is a dataset and the suite defines models and metrics to evaluate it.
Naming: I agree, we should change it. Throwing another option into the hat: EvaluationJob

NimaBoscarino · 2022-10-25T16:12:43Z

Awesome! And the SubTask/Task/EvaluationJob would receive a metric parameter? Or something like args_for_task in the glue-suite.json?

mathemakitten · 2022-10-26T04:48:59Z

Thanks @lvwerra, this is super helpful for understanding your proposal!

model_or_pipeline was mirroring the arg which gets passed to evaluator.compute(), but we can move that out of the subtask and one level up in the suite definition since I think it won't vary at the subtask level. I'm also OK with scrapping JSON in favour of this dataclass implementation — JSON is easy to write but limiting if we want to include Python code (originally I wanted this to be as simple as writing JSON to define the entire suite, but it's becoming clear that won't be possible).

Question — where should data preprocessing live if it's not being used with an evaluation suite, but rather just as an extra preprocessing step before the data gets fed into evaluator.compute? Is the proposal in that case to overwrite evaluator.prepare_data? It's not obvious to me how to tie these together so it's clear to the user where data preprocessing should go, let me know what you had in mind there!

meg-huggingface · 2022-10-27T20:27:17Z

Duplicate work: If we stick to datasets and use ds.map(data_preprocessor) then the processing would be cached by default.

Thanks. =) Yes, we discussed this earlier, and it's good by me.

I do want to clarify, though: The .map function is something we would have in-place within our code, which operates on a data_preprocessor that a user inputs, and not something that everyone who writes a data_preprocessor would need to do. The latter case would be a bit hacky (We'd have to have instructions that say something like, "For your data preprocessing, please use the .map function..."), whereas if .map were in-place within our code, and applied to user input, then no users have to worry about making sure they're caching the right way.

lvwerra · 2022-10-31T09:05:39Z

Thanks for the feedback and your thoughts. Answering the open questions:

@NimaBoscarino: Oh indeed, the metric should be an additional attribute of the data class.
@mathemakitten: I think we can load the dataset in the evaluation suite part and apply the preprocessing function there and pass the processed dataset to the evaluator (that does maybe additional processing internally). What do you think? I think that also answers your question @meg-huggingface? So yes, the .map should be internal and the user (or rather whoever defines the eval suite) just passes the preprocessing function.

mathemakitten · 2022-11-01T23:29:45Z

Superceded by #337!

mathemakitten added 2 commits September 26, 2022 12:22

Move changes over from old branch

0e28ca9

Harness commit

fa21fb3

lvwerra reviewed Sep 27, 2022

View reviewed changes

mathemakitten added 2 commits September 27, 2022 10:59

Merge branch 'main' into hn-harness

15237ee

Integrate Leandro's suggestions

0495810

mathemakitten force-pushed the hn-harness branch from 99569e2 to 0495810 Compare September 28, 2022 16:42

lvwerra requested a review from lhoestq October 5, 2022 07:49

lhoestq reviewed Oct 5, 2022

View reviewed changes

src/evaluate/harness/__init__.py Outdated Show resolved Hide resolved

src/evaluate/harness/__init__.py Outdated Show resolved Hide resolved

Apply suggestions from code review

d6e08ab

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

mathemakitten changed the title ~~Evaluation harness~~ Evaluation suite Oct 17, 2022

mathemakitten added 4 commits October 16, 2022 23:47

Rename to evaluation suite

0f73aa4

Merge branch 'hn-harness' of https://github.com/huggingface/evaluate …

160abda

…into hn-harness

Fix find and replace

1cd2e84

fix comment

9b10a41

mathemakitten mentioned this pull request Nov 1, 2022

Evaluation suite #337

Merged

mathemakitten closed this Nov 1, 2022

NimaBoscarino mentioned this pull request Nov 10, 2022

Pass metric init_kwargs from the Evaluators to metrics #351

Closed

Evaluation suite #302

Evaluation suite #302

Uh oh!

Conversation

mathemakitten commented Sep 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lvwerra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mathemakitten commented Sep 28, 2022

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lvwerra commented Oct 5, 2022

Uh oh!

lhoestq commented Oct 5, 2022

Uh oh!

mathemakitten commented Oct 17, 2022

Uh oh!

lhoestq commented Oct 17, 2022

Uh oh!

NimaBoscarino commented Oct 17, 2022

Uh oh!

mathemakitten commented Oct 18, 2022

Uh oh!

lhoestq commented Oct 18, 2022

Uh oh!

mathemakitten commented Oct 18, 2022

Uh oh!

lvwerra commented Oct 24, 2022

Uh oh!

meg-huggingface commented Oct 24, 2022

Uh oh!

lhoestq commented Oct 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lvwerra commented Oct 25, 2022

Uh oh!

NimaBoscarino commented Oct 25, 2022

Uh oh!

mathemakitten commented Oct 26, 2022

Uh oh!

meg-huggingface commented Oct 27, 2022

Uh oh!

lvwerra commented Oct 31, 2022

Uh oh!

mathemakitten commented Nov 1, 2022

Uh oh!

Uh oh!

mathemakitten commented Sep 26, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 26, 2022 •

edited

Loading

lhoestq commented Oct 24, 2022 •

edited

Loading