Skip to content

ENH: Have prep create a directory with standardized format for each prepared dataset #650

@NickleDave

Description

@NickleDave

Currently running vak prep always generates a set of spectrograms and a csv file representing a dataset.
This issue proposes that prep instead create a directory with a standardized format.

Drawbacks of current approach

There are a few drawbacks to the current approach:

  • moving files, e.g. to another computer, breaks all the paths in the csv, which currently are absolute paths
    • we could possibly fix this by writing them as relative paths but then we need to capture a notion of the "root" -- if we added another column to do this we'd repeat "root" needlessly, see next point
    • but there are multiple "semantics" for "paths" in the csv: for a spectrogram dataset, we have the 'audio_path' column as a way to track provenance: what were the original audio files we generated the spectrograms from? (We also capture this info by using the same filename and adding an extension, but that filename doesn't include the path back to the original file)
  • the tabular format of a csv file can't capture all the metadata we need about a dataset
    • e.g. we want to track the duration of a timebin, which we expect to be constant across all files, so it doesn't make sense to add it as a column to the csv
  • there are other things we should be tracking as part of a dataset that we are currently not
    • e.g., for each dataset split in a learncurve, we generate vectors that represent valid windows in a WindowDataset--this abstraction lets us "crop" the dataset to a specified duration--but those vectors are put in the results; this has led us to add a previous_run_path option so we can re-run multiple experiments with the same dataset. We should instead just explicitly make these vectors part of the dataset (will raise a separate issue about making this change).
  • because the prepared dataset is not in a directory with a standardized format, it's not easy to save and move datasets, for reasons above and also just because the files will be wherever they are, in output_dir or spect_output_dir etc

Advantages of the new approach

In addition to fixing the issues just described, additional advantages of having prep make datasets as a directory with a standardized format are:

  • we can map Dataset classes onto this directory format
    • The main strength of having a Dataset class map to a directory is that we can then prepare built-in datasets ahead of times as directories, and then download e.g. as a .tar.gz archive we then extract.
    • We can also change the format if/when required with less of an impact on a user.
      • for example not clear to me right now if there would be an advantage of moving to datasets that are all numpy arrays we can load in a memory-mapped way (like DAS does with Zarr arrays)
  • This also lets us better capture the notion of different kinds of datasets
    • e.g. for training UMAP models as in ENH: Add UMAP models and datasets #631 we will need a SegmentDataset. Again if this is in a directory with a standardized structure it will just be easier to reason about. We can provide pre-generated datasets that follow Tim's notebooks that people can download as .tar.gz files

Proposed dataset structure

An initial dataset format would look something like this

dataset/
  train/
      song1.wav.npz
      song1.csv
      song2.wav.npz
      song2.csv
  val/
      song3.wav.npz
      song3.csv
  test/
      song4.wav.npz
      song4.csv
  dataset.csv
  # splits generated for learncurve
  traindur-30s-replicate-1.csv
  traindur-30s-replicate-1-source-id.npy
  traindur-30s-replicate-1-source-inds.npy
  traindur-30s-replicate-1-window-inds.npy
  config.toml  # config used to generate dataset
  prep.log  # log from run of prep
  meta.json  # any metadata

Deprecations

We will need to deprecate the spect_output_dir option.

Metadata

Metadata

Assignees

Labels

ENH: enhancementenhancement; new feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions