See newer slides in my two-new-algos-sci-ml repo!
slides/ contains files for making presentation slides.
Title: Same versus Other Cross-Validation for comparing models trained on different groups of data
Abstract: cross-validation is an essential algorithm in any machine learning analysis. Standard K-Fold cross-validation is useful for comparing different algorithms on a single data set. We propose a new variant, Same versus Other cross-validation, which can be used to determine the extent to which you can get accurate predictions, by training on some different data subset/group (person, image, geographic region, year, etc). We discuss applications to several benchmark and real-world data sets, including predicting childhood autism, carbon emissions, and presence of objects in images.
See also https://github.com/tdhock/two-new-algos-sci-ml for other slides that include a subset of the figures.
And https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/Newer_resamplers.html which explains how to use the software.
Textbook chapter with an explanation of cross-validation: Introduction to Machine Learning and Neural Networks.
Implementation tutorials:
- When is it useful to train with combined groups? in R, most recommended, shows how to use my mlr3resampling package.
Older tutorials that show how to implement same/other cross-validation without my mlr3resampling package:
Tutorials about how to run machine learning experiments in parallel:
- Cross-validation experiments on the cluster (in python)
- The importance of hyper-parameter tuning explains how to use
mlr3batchmark
andbatchtools
R packages to parallelize machine learning experiments.
Introductions to cluster computing on NAU monsoon:
conv_images.R computes results then
conv_images_10fold_figure.R makes
which shows there is a slight improvement (all-same) for the convolutional neural network, when the two subsets are the two digit image data sets.
conv_images_figure.R makes
The figure above compares prediction error rates of five learning
algorithms for the MNIST_EMNIST_rot
data set, which has two subsets,
both of which are images of digits, which look alike. The figure shows
that training on the other subset is not sufficient to get the same
level of prediction error, even for a convolutional network (1.8% same
versus 8.9% other when predicting MNIST for example). These data
suggest that learning is more difficult that we may expect, even with
a convolutional neural network which is assumed to have good
generalization/learning— actually this figure is slightly misleading
because a ReLU activation was forgotton between the last two linear
layers— this was fixed in the more recent figures, conv_images_10fold_*
.
data_Classif_simulation.R makes
data-meta.R updated to handle more groups:
data.name memory.kb rows n.groups small_group small_N large_group large_N features classes min.rows <char> <int> <int> <int> <char> <int> <char> <int> <int> <int> <int> 1: vowel 92 990 2 test 462 train 528 10 11 42 2: waveform 145 800 2 train 300 test 500 21 3 94 3: CanadaFires_downSampled 353 1491 4 306 287 395 450 46 2 138 4: aztrees3 587 5956 3 NE 1464 S 2929 21 2 55 5: aztrees4 587 5956 4 SW 497 SE 2432 21 2 55 6: CanadaFires_all 1122 4827 4 306 364 326 2538 46 2 140 7: spam 2078 4601 2 test 1536 train 3065 57 2 595 8: zipUSPS 18752 9298 2 test 2007 train 7291 256 10 147 9: NSCH_autism 66242 46010 2 2019 18202 2020 27808 364 2 546 10: EMNIST 429712 70000 2 test 10000 train 60000 784 10 1000 11: FashionMNIST 429712 70000 2 test 10000 train 60000 784 10 1000 12: KMNIST 429712 70000 2 test 10000 train 60000 784 10 1000 13: MNIST 429712 70000 2 test 10000 train 60000 784 10 892 14: QMNIST 736548 120000 2 test 60000 train 60000 784 10 5421 15: CIFAR10 1441256 60000 2 test 10000 train 50000 3072 10 1000 16: STL10 2813121 13000 2 train 5000 test 8000 27648 10 500
data_Classif_canada_fires.R makes data_Classif/CanadaFires*csv and
> canada.fires[, table(classe2, classe3)]
classe3
classe2 charred green other road scorched shadow water
bare 0 0 66 0 0 0 0
bog 0 0 41 0 0 0 0
charred 288 0 0 0 0 0 0
green 0 176 0 0 0 0 0
Lichen 0 0 47 0 0 0 0
lowgreen 0 124 0 0 0 0 0
Mortality 0 0 63 0 0 0 0
road 0 0 0 169 0 0 0
scorched 0 0 0 0 300 0 0
shadow(affected) 0 0 0 0 0 91 0
shadow(green) 0 0 0 0 0 81 0
water 0 0 0 0 0 0 217
data_Classif_batchmark_registry.R reads result of data_Classif_batchmark.R and writes data_Classif_batchmark_registry.csv and creates visualizations below,
data-meta.R creates data-meta.csv
data.name memory.kb test% rows features classes min.rows.set.class <char> <int> <int> <int> <int> <int> <int> 1: vowel 92 46 990 10 11 42 2: waveform 145 62 800 21 3 94 3: khan 2003 28 88 2308 4 3 4: spam 2078 33 4601 57 2 595 5: zipUSPS 18752 21 9298 256 10 147 6: 14cancer 22546 27 198 16063 14 2 7: EMNIST 429712 14 70000 784 10 1000 8: FashionMNIST 429712 14 70000 784 10 1000 9: KMNIST 429712 14 70000 784 10 1000 10: MNIST 429712 14 70000 784 10 892 11: QMNIST 736548 50 120000 784 10 5421 12: CIFAR10 1441256 16 60000 3072 10 1000 13: STL10 2813121 61 13000 27648 10 500
- is the iid assumption verified in real data?
- train/test data sets
- mlbench? no explicit train/test column, see mlbench.R
- mlr3data https://mlr3data.mlr-org.com/ TODO
- caret https://topepo.github.io/caret/data-sets.html segmentationData has Case column with values Train and Test. TODO
- tidymodels https://modeldata.tidymodels.org/reference/index.html TODO
- ESL2 data processed in data_Classif_esl2.R
- list of image classification data sets: https://pytorch.org/vision/stable/datasets.html
- pages like https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html#torchvision.datasets.MNIST often have a split arg.
- https://github.com/pytorch/vision/tree/main/torchvision/datasets is source code.
Below we see about 10 torchvision data sets with train arg.
>>> torch.__version__
'1.13.0+cpu'
>>> import torchvision.datasets
>>> torchvision.__version__
'0.14.0+cpu'
>>> for data_name in dir(torchvision.datasets):
... data_class = getattr(torchvision.datasets, data_name)
... ann_dict = getattr(data_class.__init__, "__annotations__", {})
... if "train" in ann_dict:
... print(data_name)
CIFAR10
CIFAR100
FashionMNIST
HMDB51
KMNIST
Kitti
MNIST
PhotoTour
QMNIST
UCF101
USPS
newer versions show the same data sets.
Why doesn’t Caltech101/256 show up above? no split/train arg.
Why doesn’t CELEBA show up? it does have split arg.
split arg can be train/test/extra https://pytorch.org/vision/stable/generated/torchvision.datasets.SVHN.html#torchvision.datasets.SVHN
Some have both train and split https://pytorch.org/vision/stable/generated/torchvision.datasets.EMNIST.html#torchvision.datasets.EMNIST
classes instead of split https://pytorch.org/vision/stable/generated/torchvision.datasets.LSUN.html#torchvision.datasets.LSUN
exceptions / not parsed correctly:
{'STL10': ({'unlabeled', 'test', 'train+unlabeled', 'train'}, " One of {'train', 'test', 'unlabeled', 'train+unlabeled'}.\n Accordingly, dataset is selected.\n")}
{'Cityscapes': (['fine', 'train', 'test', 'val', 'train', 'train_extra', 'val'], ' The image split to use, ``train``, ``test`` or ``val`` if mode="fine"\n otherwise ``train``, ``train_extra`` or ``val``\n')}
{'EMNIST': (['byclass', 'bymerge', 'balanced', 'letters', 'digits', 'mnist'], ' The dataset has 6 different splits: ``byclass``, ``bymerge``,\n ``balanced``, ``letters``, ``digits`` and ``mnist``. This argument specifies\n which one to use.\n')}
{'LFWPairs': (['train', 'test', '10fold', '10fold'], ' The image split to use. Can be one of ``train``, ``test``,\n ``10fold``. Defaults to ``10fold``.\n')}
{'MovingMNIST': (['train', 'test', 'None', 'split=None'], ' The dataset split, supports ``None`` (default), ``"train"`` and ``"test"``.\n If ``split=None``, the full data is returned.\n')}