-
Notifications
You must be signed in to change notification settings - Fork 290
Description
In addition to Metric
we also want to add other types of evaluations such as Comparison
(#34) and Measurement
(#35) following the internal discussion (https://huggingface.slack.com/archives/C035S5G2J3D/p1652200198598789). Technically, these all behave the same way as they take some inputs and compute a scores. As such they could largely be one class (essentially what Metric
is today). That means we could also load in the same fashion:
import evaluate
metric = evaluate.load("accuracy")
comparison = evaluate.load("mcnemar")
measure = evaluate.load("npmi")
While each type can live in a different folder on the repository this can cause name clashes when a name can be used for two methods (e.g. perplexity
can be a metric and a measurement). This could be solved with an additional argument for like load("perplexity", type="metric")
that resolves those conflicts. I think this would be fine.
However, there is a second conflict with Spaces: since each metric (or comparison/measurement) would have their own space with widget it is not so easy to resolve the conflicts here, unless we create an org for each type of metric: e.g. evaluate-metrics
, evaluate-comparisons
, evaluate-measurements
. Then each evaluation type is pushed to a separate org.
If that solution sounds good then we could implement the following behaviour:
- if no
type
is provided we cycle through metric/comparison/measurement and return the first result - if a
type
is provided we only look for that one and raise an error if that type does not exist
What do you think @douwekiela @lhoestq?