Skip to content

tdhock/cv-same-other-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Slides

See newer slides in my two-new-algos-sci-ml repo!

slides/ contains files for making presentation slides.

Title: Same versus Other Cross-Validation for comparing models trained on different groups of data

Abstract: cross-validation is an essential algorithm in any machine learning analysis. Standard K-Fold cross-validation is useful for comparing different algorithms on a single data set. We propose a new variant, Same versus Other cross-validation, which can be used to determine the extent to which you can get accurate predictions, by training on some different data subset/group (person, image, geographic region, year, etc). We discuss applications to several benchmark and real-world data sets, including predicting childhood autism, carbon emissions, and presence of objects in images.

See also https://github.com/tdhock/two-new-algos-sci-ml for other slides that include a subset of the figures.

And https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/Newer_resamplers.html which explains how to use the software.

Textbook chapter with an explanation of cross-validation: Introduction to Machine Learning and Neural Networks.

Implementation tutorials:

Older tutorials that show how to implement same/other cross-validation without my mlr3resampling package:

Tutorials about how to run machine learning experiments in parallel:

Introductions to cluster computing on NAU monsoon:

27 Mar 2025

conv_images.R computes results then

conv_images_10fold_figure.R makes

conv_images_10fold_figures_same_other.png and

conv_images_10fold_figure_pval.png

which shows there is a slight improvement (all-same) for the convolutional neural network, when the two subsets are the two digit image data sets.

25 Mar 2025

conv_images_figure.R makes

conv_images_figures_same_other.png

The figure above compares prediction error rates of five learning algorithms for the MNIST_EMNIST_rot data set, which has two subsets, both of which are images of digits, which look alike. The figure shows that training on the other subset is not sufficient to get the same level of prediction error, even for a convolutional network (1.8% same versus 8.9% other when predicting MNIST for example). These data suggest that learning is more difficult that we may expect, even with a convolutional neural network which is assumed to have good generalization/learning— actually this figure is slightly misleading because a ReLU activation was forgotton between the last two linear layers— this was fixed in the more recent figures, conv_images_10fold_*.

4 Apr 2024

data_Classif_simulation.R makes

data_Classif_simulation_error_panels.png

data_Classif_simulation_scatter.png

6 Mar 2024

data-meta.R updated to handle more groups:

                  data.name memory.kb   rows n.groups small_group small_N large_group large_N features classes min.rows
                     <char>     <int>  <int>    <int>      <char>   <int>      <char>   <int>    <int>   <int>    <int>
 1:                   vowel        92    990        2        test     462       train     528       10      11       42
 2:                waveform       145    800        2       train     300        test     500       21       3       94
 3: CanadaFires_downSampled       353   1491        4         306     287         395     450       46       2      138
 4:                aztrees3       587   5956        3          NE    1464           S    2929       21       2       55
 5:                aztrees4       587   5956        4          SW     497          SE    2432       21       2       55
 6:         CanadaFires_all      1122   4827        4         306     364         326    2538       46       2      140
 7:                    spam      2078   4601        2        test    1536       train    3065       57       2      595
 8:                 zipUSPS     18752   9298        2        test    2007       train    7291      256      10      147
 9:             NSCH_autism     66242  46010        2        2019   18202        2020   27808      364       2      546
10:                  EMNIST    429712  70000        2        test   10000       train   60000      784      10     1000
11:            FashionMNIST    429712  70000        2        test   10000       train   60000      784      10     1000
12:                  KMNIST    429712  70000        2        test   10000       train   60000      784      10     1000
13:                   MNIST    429712  70000        2        test   10000       train   60000      784      10      892
14:                  QMNIST    736548 120000        2        test   60000       train   60000      784      10     5421
15:                 CIFAR10   1441256  60000        2        test   10000       train   50000     3072      10     1000
16:                   STL10   2813121  13000        2       train    5000        test    8000    27648      10      500

23 Feb 2024

data_Classif_canada_fires.R makes data_Classif/CanadaFires*csv and

> canada.fires[, table(classe2, classe3)]
                  classe3
classe2            charred green other road scorched shadow water
  bare                   0     0    66    0        0      0     0
  bog                    0     0    41    0        0      0     0
  charred              288     0     0    0        0      0     0
  green                  0   176     0    0        0      0     0
  Lichen                 0     0    47    0        0      0     0
  lowgreen               0   124     0    0        0      0     0
  Mortality              0     0    63    0        0      0     0
  road                   0     0     0  169        0      0     0
  scorched               0     0     0    0      300      0     0
  shadow(affected)       0     0     0    0        0     91     0
  shadow(green)          0     0     0    0        0     81     0
  water                  0     0     0    0        0      0   217

14 Feb 2024

data_Classif_batchmark_registry.R reads result of data_Classif_batchmark.R and writes data_Classif_batchmark_registry.csv and creates visualizations below,

data_Classif_batchmark_registry_glmnet_featureless.png

data_Classif_batchmark_registry_glmnet_median_quartiles.png

6 Feb 2024

data-meta.R creates data-meta.csv

       data.name memory.kb test%   rows features classes min.rows.set.class
          <char>     <int> <int>  <int>    <int>   <int>              <int>
 1:        vowel        92    46    990       10      11                 42
 2:     waveform       145    62    800       21       3                 94
 3:         khan      2003    28     88     2308       4                  3
 4:         spam      2078    33   4601       57       2                595
 5:      zipUSPS     18752    21   9298      256      10                147
 6:     14cancer     22546    27    198    16063      14                  2
 7:       EMNIST    429712    14  70000      784      10               1000
 8: FashionMNIST    429712    14  70000      784      10               1000
 9:       KMNIST    429712    14  70000      784      10               1000
10:        MNIST    429712    14  70000      784      10                892
11:       QMNIST    736548    50 120000      784      10               5421
12:      CIFAR10   1441256    16  60000     3072      10               1000
13:        STL10   2813121    61  13000    27648      10                500

Motivation

Below we see about 10 torchvision data sets with train arg.

>>> torch.__version__
'1.13.0+cpu'
>>> import torchvision.datasets
>>> torchvision.__version__
'0.14.0+cpu'
>>> for data_name in dir(torchvision.datasets):
...     data_class = getattr(torchvision.datasets, data_name)
...     ann_dict = getattr(data_class.__init__, "__annotations__", {})
...     if "train" in ann_dict:
...         print(data_name)
CIFAR10
CIFAR100
FashionMNIST
HMDB51
KMNIST
Kitti
MNIST
PhotoTour
QMNIST
UCF101
USPS

newer versions show the same data sets.

Why doesn’t Caltech101/256 show up above? no split/train arg.

Why doesn’t CELEBA show up? it does have split arg.

split arg can be train/test/extra https://pytorch.org/vision/stable/generated/torchvision.datasets.SVHN.html#torchvision.datasets.SVHN

Some have both train and split https://pytorch.org/vision/stable/generated/torchvision.datasets.EMNIST.html#torchvision.datasets.EMNIST

classes instead of split https://pytorch.org/vision/stable/generated/torchvision.datasets.LSUN.html#torchvision.datasets.LSUN

exceptions / not parsed correctly:

{'STL10': ({'unlabeled', 'test', 'train+unlabeled', 'train'}, " One of {'train', 'test', 'unlabeled', 'train+unlabeled'}.\n            Accordingly, dataset is selected.\n")}
{'Cityscapes': (['fine', 'train', 'test', 'val', 'train', 'train_extra', 'val'], ' The image split to use, ``train``, ``test`` or ``val`` if mode="fine"\n            otherwise ``train``, ``train_extra`` or ``val``\n')}
{'EMNIST': (['byclass', 'bymerge', 'balanced', 'letters', 'digits', 'mnist'], ' The dataset has 6 different splits: ``byclass``, ``bymerge``,\n            ``balanced``, ``letters``, ``digits`` and ``mnist``. This argument specifies\n            which one to use.\n')}
{'LFWPairs': (['train', 'test', '10fold', '10fold'], ' The image split to use. Can be one of ``train``, ``test``,\n            ``10fold``. Defaults to ``10fold``.\n')}
{'MovingMNIST': (['train', 'test', 'None', 'split=None'], ' The dataset split, supports ``None`` (default), ``"train"`` and ``"test"``.\n            If ``split=None``, the full data is returned.\n')}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published