-
Notifications
You must be signed in to change notification settings - Fork 292
Description
Often models are evaluated on multiple metrics in a project. E.g. a classification project might always want to report the Accuracy, Precision, Recall, and F1 score. In scikit-learn
one use the classification report for that which is widely used. This takes this a step further and allows the user to freely compose metrics. Similar to a DatasetDict
one could use the MetricSuite
like a Metric
object.
metrics_suite = MetricsSuite(
{
"accuray": load_metric("accuracy"),
"recall": load_metric("recall")
}
)
metrics_suite = MetricsSuite(
{
"bleu": load_metric("bleu"),
"rouge": load_metric("rouge"),
"perplexity": load_metric("perplexity")
}
)
metrics_suite.add(predictions, references)
metrics_suite.compute()
>>> {"bleu": bleu_result_dict, "rouge": roughe_result_dict, "perplexity": perplexity_result_dict}
Alternatively, we could also flatten the return dict or have it as an option. We could also add a summary
option that defines how an overall result is calculated. E.g. summary="average"
averages all the metrics into a summary metric or a custom function with summary=lambda x: x["bleu"]**2 + 0.5*["rouge"]+2
. This would allow to create simple, composed metrics without the needing to define a new metric (e.g. for a custom benchmark).