I'd like to add matthews correlation for the multilabel case. This essentially has a few options: * ("micro") flatten predictions and targets, then calculate * ("macro") calculate it per-feature and average * Implement micro, macro as in f1 score. Since scikit-learn doesn't support this out of the box, this may also be a terrible idea for some reason, in which case I'd like to learn why.