Skip to content

Refactor for loading multiple evaluation categories #38

@lvwerra

Description

@lvwerra

In addition to Metric we also want to add other types of evaluations such as Comparison (#34) and Measurement (#35) following the internal discussion (https://huggingface.slack.com/archives/C035S5G2J3D/p1652200198598789). Technically, these all behave the same way as they take some inputs and compute a scores. As such they could largely be one class (essentially what Metric is today). That means we could also load in the same fashion:

import evaluate

metric = evaluate.load("accuracy")
comparison = evaluate.load("mcnemar")
measure = evaluate.load("npmi")

While each type can live in a different folder on the repository this can cause name clashes when a name can be used for two methods (e.g. perplexity can be a metric and a measurement). This could be solved with an additional argument for like load("perplexity", type="metric") that resolves those conflicts. I think this would be fine.

However, there is a second conflict with Spaces: since each metric (or comparison/measurement) would have their own space with widget it is not so easy to resolve the conflicts here, unless we create an org for each type of metric: e.g. evaluate-metrics, evaluate-comparisons, evaluate-measurements. Then each evaluation type is pushed to a separate org.

If that solution sounds good then we could implement the following behaviour:

  • if no type is provided we cycle through metric/comparison/measurement and return the first result
  • if a type is provided we only look for that one and raise an error if that type does not exist

What do you think @douwekiela @lhoestq?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions