Skip to content

Evaluation suite #302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

Evaluation suite #302

wants to merge 9 commits into from

Conversation

mathemakitten
Copy link
Contributor

@mathemakitten mathemakitten commented Sep 26, 2022

Requires #301 to be merged first (lots of these diffs on data loading in the evaluator code will disappear afterward).

You can run the GLUE suite of tasks like this, for example:

from evaluate.evaluation_suite import load_evaluation_suite
evaluation_suite = load_evaluation_suite('mathemakitten/glue-suite')
results = evaluation_suite.run(model_or_pipeline='gpt2')

where the JSON file contains metadata on the dataset and args to pass to the evaluator. The example config for GLUE is here.

After this core logic is nailed down I'll move toward incorporating a larger set of tasks (e.g. the LM harness by EleutherAI, which requires support for multiple evaluators for one model). Also, qqp is actually a semantic similarity task and should be evaluated as such, we don't yet have the right evaluator implemented, though.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 26, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mathemakitten, this looks really good and clean! I left some comments and maybe @lhoestq also has some opinions

@mathemakitten
Copy link
Contributor Author

Thanks @lvwerra, have added all those suggestions! If the API looks good to you now I'll add some tests assuming this format :)

@lvwerra lvwerra requested a review from lhoestq October 5, 2022 07:49
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For you information there are evaluation metadata already in the dataset cards that contain the same info as your JSON file. See the train_eval_index content of glue for example: https://huggingface.co/datasets/glue/blob/main/README.md#L29, or even the imdb one that also contain the metrics to use: https://huggingface.co/datasets/imdb/blob/main/README.md#L32

I think it would be nice if we could reuse train_eval_index to not end up with duplicate info scattered in the Hub. What do you think ?

@lvwerra
Copy link
Member

lvwerra commented Oct 5, 2022

@lhoestq maybe we can use the train_eval_index by default if nothing is specified and have the option to overwrite (if a train_eval_index is available) or provide (if no train_eval_index is available). I think in some cases people might want to setup a custom setup?

@lhoestq
Copy link
Member

lhoestq commented Oct 5, 2022

I see ! Users could overwrite it with kwargs in load_harness. And to have a full custom setup you can simply copy the dataset repository and modify the train_eval_index directly ?

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@mathemakitten mathemakitten changed the title Evaluation harness Evaluation suite Oct 17, 2022
@mathemakitten
Copy link
Contributor Author

Thanks for the pointer re: train_eval_index! Yes, we can definitely use the existing train_eval_index by default if nothing is specified and overwrite with the JSON config if needed. Will incorporate.

@lhoestq
Copy link
Member

lhoestq commented Oct 17, 2022

Cool ! Overall I'd be in favor of not introducing new files or metadata if we can use something that exists already - this is less confusing for users

@NimaBoscarino
Copy link
Contributor

Could there be a way to pass config_name down to a metric? Right now I think we only have access to metric as a string in the args_for_task, which means that AFAIK we can't pass a config_name or any other arguments for the metric.

@mathemakitten
Copy link
Contributor Author

OK, I had a look at how train_eval_index is formulated and I'm having a bit of trouble reconciling how to get all the features we need out of it. After looking at a few datasets with train_eval_index I am still leaning toward having a separate EvaluationSuite config file, but maybe if the fields aren't specified, then we can pull them from train_eval_index? Either way I don't think we can avoid requiring new metadata in a config somewhere, but let me know what you think? Sorry in advance for the long read! 🤗

  • The upside of the evaluation suite should be that it allows you to compose arbitrary existing datasets from the Hub together for evaluation.
  • If I wanted to evaluate on just a subset of GLUE, is there a way that I can extract just those two tasks without writing another config or custom code or cloning the dataset repo? For example, if I wanted to evaluate on x tasks from GLUE as well as y tasks from SuperGLUE, I want to reuse the existing datasets for those instead of creating a whole new dataset, copying over the data and specify a custom train_eval_index. It seems like we would have a lot of data duplication if everyone created new custom datasets (one copy for each combination of tasks we want to evaluate on). The upside of having an evaluation suite have its own config is that we can avoid copying the dataset repo and just reuse existing datasets on the Hub by composing them and using the Evaluator's data loading logic to load the subsets we need.
  • Since load_evaluation_suite should work with multiple different types of evaluators and it would get really messy quickly to have a big nested data structure full of kwargs for each evaluator passed around in the code, I think we should avoid passing custom kwargs to the load_evaluation_suite function. I tried to mirror the way that metric modules are loaded right now where ideally we'd like for someone to be able to simply upload a config file for the evaluation suite and the code should just load it by name without extra kwargs.
  • If we want to allow passing down config_name to metrics we would need to specify that somewhere as well; either in the train_eval_index or in a specific evaluation suite config file. Right now the attributes outlined in train_eval_index seem to all be attributes of the dataset itself. However, the evaluator also requires other kwargs which are not present in train_eval_index; e.g. "label_mapping": {"LABEL_0": 0.0, "LABEL_1": 1.0} — this is specific to a dataset-evaluator pair, and not an attribute of the dataset alone, so it wouldn't make sense to add it to the train_eval_index, right? If we don't want to modify the fields allowed in train_eval_index and they should remain dataset-specific, we need to pass around these specific kwargs somewhere (and see above for why I don't think it should be passed in load_evaluation_suite).

@lhoestq
Copy link
Member

lhoestq commented Oct 18, 2022

Oh ok I see ! Let's use new config files then ^^' you can use a space to store it and have a custom UI to present it. In a dataset repo you can only do markdown, and it would be weird to show a preview of a config file

Note that enriching a bit train_eval_index is maybe possible cc @lewtun

@mathemakitten
Copy link
Contributor Author

okok will do! That makes sense re: markdown and config, thanks @lhoestq :) In that case I'll keep the EvaluationSuite config. I appreciate you bringing up the train_eval_index though because I didn't know that existed and I also generally agree with trying to maximize code reuse whenever possible. We should definitely consider extending train_eval_index to be usable with evaluate though, it just might be a bit of a bigger undertaking.

@lvwerra
Copy link
Member

lvwerra commented Oct 24, 2022

Following-up on last week's discussion, here are some ideas how we could frame the SubTasks (placeholder name, I am sure there is a better one) of the EvaluationSuite as classes/dataclasses. In discussions it became clear that some datasets will need some preprocessing before passing them to the evaluator and the question is where to put that logic. Here I propose to define it in a Python script along with the SubTask definition.

In evaluate:

@dataclass
def SubTask:
    model_or_pipeline: Optional[Union[str, Model]] = None
    data: Optional[Union[str, Dataset, Array]] = None
    subset: Optional[str] = None
    split: Optional[str] = None
    data_preprocessor: Optional[Callable] = None 

On the Hub or locally (lvwerra/sentiment_suite.py):

preprocessor = lambda x: x["text"].lower()

suite = [
    SubTask(
        data="imdb",
        split="test",
        data_preprocessor=preprocessor
    ),
    SubTask(
        data="sst2",
        split="validation",
        data_preprocessor=preprocessor
    )
]

User (notebook_17.ipynb):

from evalute import Suite

suite = Suite.load("lvwerra/sentiment_suite") #from hub
suite.run("lvwerra/distilbert-imdb")

We could also support loading from JSON but in that case I think we should disable the data_preprocessor argument and instead point the class documentation. I think it could get complicated with the importing a Python function/module via JSON. This is just a draft/skeleton - let me know what you think!

@meg-huggingface
Copy link
Contributor

Thanks so much for this @lvwerra ! The idea of a preprocessor for the data that you can pass in is a great idea.
My main concern about the given approach is that it looks like you would have to preprocess the same data for each task using the same dataset, rather than preprocessing the data once and then passing it in for each. Is that right? If so, that's a lot of duplicated work for a time-consuming computation.

My thought was that it could work something like this.
Assuming you have 2 tasks you want to run. The first one might be called for example "multi_classification", for classifying content among multiple labels. The second one, "binary_classification", for a basic model.

python
preprocessor = lambda x: x["text"].lower()
 
suite = [
TaskDataset(
  data="imdb",
  split="test",
  data_preprocessor=preprocessor,
  Task("multi_classification"),
  Task("binary_classification")
),
TaskDataset(
   data="sst2",
   split="test",
   data_preprocessor=preprocessor,
   SubTask("multi_classification"),
   SubTask("binary_classification")
)
]

Does that make sense? In this way, data would only need to be processed once for all the ways it can be used.

Regarding this bit:

We could also support loading from JSON but in that case I think we should disable the data_preprocessor argument and instead point the class documentation. I think it could get complicated with the importing a Python function/module via JSON. This is just a draft/skeleton - let me know what you think!

I'm supportive of not using json, especially if not using it allows us to use a data_preprocessor. =)

@lhoestq
Copy link
Member

lhoestq commented Oct 24, 2022

Cool ! The "EvaluationSuite" sounds clear to me and also how to define it, nice ! ^^

Other ideas of name for "SubTask" (since "task" is already used in the HF ecosystem for other things): maybe something like "EvaluationData" or "DataSource"

Could you also give more details on why there's "model_or_pipeline" in what you call "SubTask" ?

@lvwerra
Copy link
Member

lvwerra commented Oct 25, 2022

  • Duplicate work: If we stick to datasets and use ds.map(data_preprocessor) then the processing would be cached by default. So if two tasks would use the same dataset + configuration (via split and subset) + data_preprocessor then the result would be loaded from cache rather than recomputed (even on TB scale this only takes a few minutes). Does that address your concern @meg-huggingface?

  • model_or_pipeline: Also not entirely sure, copied it from @mathemakitten's implementation :) The classical evaluation benchmark would take this as input I guess and assume dataset and metric as given. Maybe the idea is to build a data evaluation benchmark with it too, where the input is a dataset and the suite defines models and metrics to evaluate it.

  • Naming: I agree, we should change it. Throwing another option into the hat: EvaluationJob

@NimaBoscarino
Copy link
Contributor

Awesome! And the SubTask/Task/EvaluationJob would receive a metric parameter? Or something like args_for_task in the glue-suite.json?

@mathemakitten
Copy link
Contributor Author

Thanks @lvwerra, this is super helpful for understanding your proposal!

model_or_pipeline was mirroring the arg which gets passed to evaluator.compute(), but we can move that out of the subtask and one level up in the suite definition since I think it won't vary at the subtask level. I'm also OK with scrapping JSON in favour of this dataclass implementation — JSON is easy to write but limiting if we want to include Python code (originally I wanted this to be as simple as writing JSON to define the entire suite, but it's becoming clear that won't be possible).

Question — where should data preprocessing live if it's not being used with an evaluation suite, but rather just as an extra preprocessing step before the data gets fed into evaluator.compute? Is the proposal in that case to overwrite evaluator.prepare_data? It's not obvious to me how to tie these together so it's clear to the user where data preprocessing should go, let me know what you had in mind there!

@meg-huggingface
Copy link
Contributor

  • Duplicate work: If we stick to datasets and use ds.map(data_preprocessor) then the processing would be cached by default.

Thanks. =) Yes, we discussed this earlier, and it's good by me.

I do want to clarify, though: The .map function is something we would have in-place within our code, which operates on a data_preprocessor that a user inputs, and not something that everyone who writes a data_preprocessor would need to do. The latter case would be a bit hacky (We'd have to have instructions that say something like, "For your data preprocessing, please use the .map function..."), whereas if .map were in-place within our code, and applied to user input, then no users have to worry about making sure they're caching the right way.

@lvwerra
Copy link
Member

lvwerra commented Oct 31, 2022

Thanks for the feedback and your thoughts. Answering the open questions:

  • @NimaBoscarino: Oh indeed, the metric should be an additional attribute of the data class.
  • @mathemakitten: I think we can load the dataset in the evaluation suite part and apply the preprocessing function there and pass the processed dataset to the evaluator (that does maybe additional processing internally). What do you think? I think that also answers your question @meg-huggingface? So yes, the .map should be internal and the user (or rather whoever defines the eval suite) just passes the preprocessing function.

@mathemakitten mathemakitten mentioned this pull request Nov 1, 2022
@mathemakitten
Copy link
Contributor Author

Superceded by #337!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants