Title: Optimizing ROC curves using torch in R
Receiver Operating Characteristic (ROC) curves are useful for evaluating binary classification models, especially when data are unbalanced (97% negative, 3% positive, as in medical diagnosis, image segmentation, etc). We propose a new surrogate loss function called the AUM, which can be used to optimize ROC curves during gradient descent learning. Whereas previous loss functions are based on summing over all labeled examples or pairs, the AUM requires a sort and a sum over the sequence of points on the ROC curve. We show how the AUM loss can be easily implemented in torch code (using R or python), so the ROC curve optimization objective can be used during neural network training (in addition to its typical use for evaluation). In our empirical study of unbalanced binary classification problems, we show that our new AUM minimization learning algorithm results in improved AUC and speed relative to previous baselines.
Title: Two new algorithms for scientific applications of machine learning
Speaker: Toby Dylan HOCKING, https://tdhock.github.io/
Abstract: In the last few years, I have maintained active collaborations with scientists who are not machine learning experts, but who want to use machine learning algorithms for their data analyses. In many scientific applications of machine learning, two questions come up again and again.
- Question 1. we have some data from one region (or time period), so if we use these data to train, will it work on a new region? (or time period)
- Question 2. how do we deal with class imbalance?
- Example A: forestry. When predicting forest properties based on objects in satellite images, if we train on one region (say Arizona), will it work in another? (California or Quebec) How to deal with the fact that some objects of interest (trees, burn) are only a small minority of data?
- Example B: medicine. When predicting autism diagnosis from other survey responses, if we train on one year of survey data (say 2019), will it work in another year? (say 2020) And can we combine the two years of data to get a better model? How to deal with the fact that autism represents only 3% of the total surveys? (97% of survey respondants did not have autism)
- For Question 1, we propose a new algorithm called SOAK (Same/Other/All K-fold Cross-Validation), which can be used to quantify the extent to which it is possible to predict on a given data subset, after training on Same/Other/All data subsets. https://arxiv.org/abs/2410.08643
- For Question 2, we propose a new differentiable loss function which can be used to optimize the ROC curve, https://jmlr.org/papers/v24/21-0751.html
- I am a new professor at Université de Sherbrooke since June 2024, and I am open to collaborative research projects / co-supervising students.
Slides PDF HOCKING-two-new-algos-sci-ml-slides.pdf
Source files
- HOCKING-two-new-algos-sci-ml-slides.tex
- https://github.com/tdhock/max-generalized-auc
- https://github.com/tdhock/cv-same-other-paper with older slides showing different details.
Software
- mlr3resampling R package implements SOAK algorithm, tutorial.
- AUC and AUM in torch blog explaining how to implement AUM.
- aum R package implements directional derivatives and efficient line search.
Thanks to Valentina Boeva for the revision.
Transfer Learning and Imbalanced Classification in Scientific Machine Learning Speaker: Prof. Dr. Toby Dylan Hocking Host: Prof. Dr. Valentina Boeva When: May 7, 2025, 15:00. Room: H52, CAB, Universitatstrasse 6, 8006, Zurich
Abstract
Machine learning is increasingly being adopted in scientific fields by domain experts who are not machine learning specialists. Yet, recurring challenges arise in these collaborations:
- Generalization across domains: How well does a model trained on data from one region or time period transfer to another?
- Class imbalance: How can we effectively learn from data when the positive class is rare?
In this talk, I will present two new algorithms developed to address these challenges, motivated by real-world scientific applications:
- SOAK (Same/Other/All K-fold Cross-Validation): a novel cross-validation framework to quantify and diagnose the extent of domain generalization across data subsets (e.g., geographic regions, temporal slices).
- A differentiable ROC-based loss function: a new loss function designed to directly optimize ROC-AUC in the presence of severe class imbalance, compatible with deep learning architectures.
I will illustrate these methods through case studies in remote sensing (predicting forest attributes across regions from satellite imagery) and computational psychiatry (predicting autism diagnosis across survey years). The talk will discuss both the algorithmic ideas and their practical implications for machine learning in scientific settings.
Bio: Toby Dylan Hocking is an Associate Professor of Computer Science at Université de Sherbrooke and an Associate Academic Member at Mila – Quebec Artificial Intelligence Institute. His research focuses on developing machine learning algorithms and statistical software for scientific applications, with an emphasis on interpretable models, cross-validation, and imbalanced classification. He directs the LASSO research lab (Learning Algorithms, Statistical Software, Optimization). Toby earned his PhD in mathematics (machine learning) from École Normale Supérieure de Cachan (France) and has held research positions at Tokyo Tech, McGill University, and Northern Arizona University. He has authored dozens of R packages and published over 40 peer-reviewed papers in machine learning and statistical computing. He has mentored over 30 research students and 30 contributors in Google Summer of Code through the R Project.