GitHub - tdhock/two-new-algos-sci-ml

R abstract

Title: Optimizing ROC curves using torch in R

Receiver Operating Characteristic (ROC) curves are useful for evaluating binary classification models, especially when data are unbalanced (97% negative, 3% positive, as in medical diagnosis, image segmentation, etc). We propose a new surrogate loss function called the AUM, which can be used to optimize ROC curves during gradient descent learning. Whereas previous loss functions are based on summing over all labeled examples or pairs, the AUM requires a sort and a sum over the sequence of points on the ROC curve. We show how the AUM loss can be easily implemented in torch code (using R or python), so the ROC curve optimization objective can be used during neural network training (in addition to its typical use for evaluation). In our empirical study of unbalanced binary classification problems, we show that our new AUM minimization learning algorithm results in improved AUC and speed relative to previous baselines.

Title, abstract, slides

Title: Two new algorithms for scientific applications of machine learning

Speaker: Toby Dylan HOCKING, https://tdhock.github.io/

Abstract: In the last few years, I have maintained active collaborations with scientists who are not machine learning experts, but who want to use machine learning algorithms for their data analyses. In many scientific applications of machine learning, two questions come up again and again.

Question 1. we have some data from one region (or time period), so if we use these data to train, will it work on a new region? (or time period)
Question 2. how do we deal with class imbalance?
Example A: forestry. When predicting forest properties based on objects in satellite images, if we train on one region (say Arizona), will it work in another? (California or Quebec) How to deal with the fact that some objects of interest (trees, burn) are only a small minority of data?
Example B: medicine. When predicting autism diagnosis from other survey responses, if we train on one year of survey data (say 2019), will it work in another year? (say 2020) And can we combine the two years of data to get a better model? How to deal with the fact that autism represents only 3% of the total surveys? (97% of survey respondants did not have autism)
For Question 1, we propose a new algorithm called SOAK (Same/Other/All K-fold Cross-Validation), which can be used to quantify the extent to which it is possible to predict on a given data subset, after training on Same/Other/All data subsets. https://arxiv.org/abs/2410.08643
For Question 2, we propose a new differentiable loss function which can be used to optimize the ROC curve, https://jmlr.org/papers/v24/21-0751.html
I am a new professor at Université de Sherbrooke since June 2024, and I am open to collaborative research projects / co-supervising students.

Slides PDF HOCKING-two-new-algos-sci-ml-slides.pdf

Source files

Software

mlr3resampling R package implements SOAK algorithm, tutorial.
AUC and AUM in torch blog explaining how to implement AUM.
aum R package implements directional derivatives and efficient line search.

Revised abstract

Thanks to Valentina Boeva for the revision.

Transfer Learning and Imbalanced Classification in Scientific Machine Learning Speaker: Prof. Dr. Toby Dylan Hocking Host: Prof. Dr. Valentina Boeva When: May 7, 2025, 15:00. Room: H52, CAB, Universitatstrasse 6, 8006, Zurich

Abstract

Machine learning is increasingly being adopted in scientific fields by domain experts who are not machine learning specialists. Yet, recurring challenges arise in these collaborations:

Generalization across domains: How well does a model trained on data from one region or time period transfer to another?
Class imbalance: How can we effectively learn from data when the positive class is rare?

In this talk, I will present two new algorithms developed to address these challenges, motivated by real-world scientific applications:

SOAK (Same/Other/All K-fold Cross-Validation): a novel cross-validation framework to quantify and diagnose the extent of domain generalization across data subsets (e.g., geographic regions, temporal slices).
A differentiable ROC-based loss function: a new loss function designed to directly optimize ROC-AUC in the presence of severe class imbalance, compatible with deep learning architectures.

I will illustrate these methods through case studies in remote sensing (predicting forest attributes across regions from satellite imagery) and computational psychiatry (predicting autism diagnosis across survey years). The talk will discuss both the algorithmic ideas and their practical implications for machine learning in scientific settings.

Bio: Toby Dylan Hocking is an Associate Professor of Computer Science at Université de Sherbrooke and an Associate Academic Member at Mila – Quebec Artificial Intelligence Institute. His research focuses on developing machine learning algorithms and statistical software for scientific applications, with an emphasis on interpretable models, cross-validation, and imbalanced classification. He directs the LASSO research lab (Learning Algorithms, Statistical Software, Optimization). Toby earned his PhD in mathematics (machine learning) from École Normale Supérieure de Cachan (France) and has held research positions at Tokyo Tech, McGill University, and Northern Arizona University. He has authored dozens of R packages and published over 40 peer-reviewed papers in machine learning and statistical computing. He has mentored over 30 research students and 30 contributors in Google Summer of Code through the R Project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

R abstract

Title, abstract, slides

Revised abstract

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.gitignore		.gitignore
2022-10-14_ML_group_meeting.jpg		2022-10-14_ML_group_meeting.jpg
HOCKING-AUM-slides.tex		HOCKING-AUM-slides.tex
HOCKING-SOAK-slides.tex		HOCKING-SOAK-slides.tex
HOCKING-two-new-algos-sci-ml-slides.pdf		HOCKING-two-new-algos-sci-ml-slides.pdf
HOCKING-two-new-algos-sci-ml-slides.tex		HOCKING-two-new-algos-sci-ml-slides.tex
MNIST_EMNIST_error_glmnet_featureless_mean_SD.png		MNIST_EMNIST_error_glmnet_featureless_mean_SD.png
MNIST_EMNIST_error_glmnet_featureless_mean_SD_zoom.png		MNIST_EMNIST_error_glmnet_featureless_mean_SD_zoom.png
MNIST_EMNIST_rot_error_glmnet_featureless_mean_SD.png		MNIST_EMNIST_rot_error_glmnet_featureless_mean_SD.png
MNIST_EMNIST_rot_error_glmnet_featureless_mean_SD_zoom.png		MNIST_EMNIST_rot_error_glmnet_featureless_mean_SD_zoom.png
MNIST_FashionMNIST_error_glmnet_featureless_mean_SD.png		MNIST_FashionMNIST_error_glmnet_featureless_mean_SD.png
MNIST_FashionMNIST_error_glmnet_featureless_mean_SD_zoom.png		MNIST_FashionMNIST_error_glmnet_featureless_mean_SD_zoom.png
MNIST_error_glmnet_featureless_mean_SD.png		MNIST_error_glmnet_featureless_mean_SD.png
NSCH_autism_error_glmnet_sizes_mean_sd_more.png		NSCH_autism_error_glmnet_sizes_mean_sd_more.png
README.org		README.org
STL10_error_glmnet_sizes_mean_sd_more.png		STL10_error_glmnet_sizes_mean_sd_more.png
autoVsExplicitSubGradients-1.png		autoVsExplicitSubGradients-1.png
cellprofiler.png		cellprofiler.png
conv_images_10fold_figure_pval.png		conv_images_10fold_figure_pval.png
conv_images_figures_same_other.png		conv_images_figures_same_other.png
data_Classif_MNIST_other_1.png		data_Classif_MNIST_other_1.png
data_Classif_MNIST_other_2.png		data_Classif_MNIST_other_2.png
data_Classif_MNIST_other_EMNIST.png		data_Classif_MNIST_other_EMNIST.png
data_Classif_MNIST_other_EMNIST_rot.png		data_Classif_MNIST_other_EMNIST_rot.png
data_Classif_MNIST_other_FashionMNIST.png		data_Classif_MNIST_other_FashionMNIST.png
data_Classif_batchmark_registry_glmnet_featureless_mean_sd.png		data_Classif_batchmark_registry_glmnet_featureless_mean_sd.png
data_Classif_batchmark_registry_glmnet_featureless_mean_sd.xcf		data_Classif_batchmark_registry_glmnet_featureless_mean_sd.xcf
data_Classif_batchmark_registry_glmnet_featureless_mean_sd_different_all_test.png		data_Classif_batchmark_registry_glmnet_featureless_mean_sd_different_all_test.png
data_Classif_batchmark_registry_glmnet_featureless_mean_sd_different_all_train.png		data_Classif_batchmark_registry_glmnet_featureless_mean_sd_different_all_train.png
data_Classif_batchmark_registry_glmnet_featureless_mean_sd_other_different_train.png		data_Classif_batchmark_registry_glmnet_featureless_mean_sd_other_different_train.png
data_Classif_batchmark_registry_glmnet_featureless_mean_sd_other_similar_train.png		data_Classif_batchmark_registry_glmnet_featureless_mean_sd_other_similar_train.png
data_Classif_batchmark_registry_glmnet_featureless_mean_sd_other_test.png		data_Classif_batchmark_registry_glmnet_featureless_mean_sd_other_test.png
data_Classif_batchmark_registry_glmnet_featureless_mean_sd_similar_all_test.png		data_Classif_batchmark_registry_glmnet_featureless_mean_sd_similar_all_test.png
data_Classif_batchmark_registry_glmnet_featureless_mean_sd_similar_all_train.png		data_Classif_batchmark_registry_glmnet_featureless_mean_sd_similar_all_train.png
data_Classif_batchmark_registry_scatter_all.png		data_Classif_batchmark_registry_scatter_all.png
data_Classif_batchmark_registry_scatter_all_segments.png		data_Classif_batchmark_registry_scatter_all_segments.png
data_Classif_batchmark_registry_scatter_all_segments_flip.png		data_Classif_batchmark_registry_scatter_all_segments_flip.png
data_Classif_batchmark_registry_scatter_other.png		data_Classif_batchmark_registry_scatter_other.png
data_Classif_batchmark_registry_scatter_other_all.png		data_Classif_batchmark_registry_scatter_other_all.png
data_Classif_batchmark_registry_scatter_other_all_similar.png		data_Classif_batchmark_registry_scatter_other_all_similar.png
data_Classif_batchmark_registry_scatter_other_segments.png		data_Classif_batchmark_registry_scatter_other_segments.png
data_Classif_batchmark_registry_scatter_other_segments_flip.png		data_Classif_batchmark_registry_scatter_other_segments_flip.png
data_Classif_batchmark_registry_scatter_other_zoom.png		data_Classif_batchmark_registry_scatter_other_zoom.png
data_Classif_batchtools_best_valid_scatter.png		data_Classif_batchtools_best_valid_scatter.png
drawing-cross-validation.pdf		drawing-cross-validation.pdf
drawing-cv-same-all-years-ann.png		drawing-cv-same-all-years-ann.png
drawing-cv-same-other-generic.pdf		drawing-cv-same-other-generic.pdf
drawing-cv-same-other-years-1.pdf		drawing-cv-same-other-years-1.pdf
drawing-cv-same-other-years-2.pdf		drawing-cv-same-other-years-2.pdf
drawing-cv-same-other-years-3.pdf		drawing-cv-same-other-years-3.pdf
drawing-cv-same-other-years-4.pdf		drawing-cv-same-other-years-4.pdf
drawing-cv-same-other-years-ann.png		drawing-cv-same-other-years-ann.png
fashion-mnist-boot.png		fashion-mnist-boot.png
figure-2-algos-test-error-train-time.png		figure-2-algos-test-error-train-time.png
figure-aum-neural-networks-test-auc.png		figure-aum-neural-networks-test-auc.png
figure-aztrees-zoom-in.png		figure-aztrees-zoom-in.png
figure-aztrees-zoom-out.png		figure-aztrees-zoom-out.png
figure-aztrees.R		figure-aztrees.R
figure-aztrees.png		figure-aztrees.png
figure-aztrees.xcf		figure-aztrees.xcf
figure-batchtools-expired-earth-roc.png		figure-batchtools-expired-earth-roc.png
figure-compare-hinge-loss-contours-logistic.png		figure-compare-hinge-loss-contours-logistic.png
figure-learn-digits-clothing.png		figure-learn-digits-clothing.png
figure-more-than-one-new-binary-aum-rate.png		figure-more-than-one-new-binary-aum-rate.png
figure-more-than-one-new-binary-heat.png		figure-more-than-one-new-binary-heat.png
figure-more-than-one-new-binary.png		figure-more-than-one-new-binary.png
figure-proda-cv-map-West-all.png		figure-proda-cv-map-West-all.png
figure-proda-cv-map-West-other.png		figure-proda-cv-map-West-other.png
figure-proda-cv-map-West-same.png		figure-proda-cv-map-West-same.png
gg_aum_grad.png		gg_aum_grad.png
mnist-0.jpeg		mnist-0.jpeg
roc-0.1percent.png		roc-0.1percent.png
roc-20percent-xgboost.png		roc-20percent-xgboost.png
roc-20percent.png		roc-20percent.png
roc-gradient-arrows-proposed.png		roc-gradient-arrows-proposed.png
roc-gradient-arrows-proposed.xcf		roc-gradient-arrows-proposed.xcf
roc-gradient-arrows.png		roc-gradient-arrows.png
roc-gradient-arrows.xcf		roc-gradient-arrows.xcf
waveform_error_glmnet_sizes_mean_sd_more.png		waveform_error_glmnet_sizes_mean_sd_more.png

tdhock/two-new-algos-sci-ml

Folders and files

Latest commit

History

Repository files navigation

R abstract

Title, abstract, slides

Revised abstract

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages