-
Notifications
You must be signed in to change notification settings - Fork 17
Closed
Labels
ENH: enhancementenhancement; new feature or requestenhancement; new feature or request
Description
Currently running vak prep
always generates a set of spectrograms and a csv file representing a dataset.
This issue proposes that prep
instead create a directory with a standardized format.
Drawbacks of current approach
There are a few drawbacks to the current approach:
- moving files, e.g. to another computer, breaks all the paths in the csv, which currently are absolute paths
- we could possibly fix this by writing them as relative paths but then we need to capture a notion of the "root" -- if we added another column to do this we'd repeat "root" needlessly, see next point
- but there are multiple "semantics" for "paths" in the csv: for a spectrogram dataset, we have the
'audio_path'
column as a way to track provenance: what were the original audio files we generated the spectrograms from? (We also capture this info by using the same filename and adding an extension, but that filename doesn't include the path back to the original file)
- the tabular format of a csv file can't capture all the metadata we need about a dataset
- e.g. we want to track the duration of a timebin, which we expect to be constant across all files, so it doesn't make sense to add it as a column to the csv
- there are other things we should be tracking as part of a dataset that we are currently not
- e.g., for each dataset split in a learncurve, we generate vectors that represent valid windows in a WindowDataset--this abstraction lets us "crop" the dataset to a specified duration--but those vectors are put in the results; this has led us to add a
previous_run_path
option so we can re-run multiple experiments with the same dataset. We should instead just explicitly make these vectors part of the dataset (will raise a separate issue about making this change).
- e.g., for each dataset split in a learncurve, we generate vectors that represent valid windows in a WindowDataset--this abstraction lets us "crop" the dataset to a specified duration--but those vectors are put in the results; this has led us to add a
- because the prepared dataset is not in a directory with a standardized format, it's not easy to save and move datasets, for reasons above and also just because the files will be wherever they are, in
output_dir
orspect_output_dir
etc
Advantages of the new approach
In addition to fixing the issues just described, additional advantages of having prep make datasets as a directory with a standardized format are:
- we can map
Dataset
classes onto this directory format- The main strength of having a Dataset class map to a directory is that we can then prepare built-in datasets ahead of times as directories, and then download e.g. as a .tar.gz archive we then extract.
- We can also change the format if/when required with less of an impact on a user.
- for example not clear to me right now if there would be an advantage of moving to datasets that are all numpy arrays we can load in a memory-mapped way (like DAS does with Zarr arrays)
- This also lets us better capture the notion of different kinds of datasets
- e.g. for training UMAP models as in ENH: Add UMAP models and datasets #631 we will need a
SegmentDataset
. Again if this is in a directory with a standardized structure it will just be easier to reason about. We can provide pre-generated datasets that follow Tim's notebooks that people can download as .tar.gz files
- e.g. for training UMAP models as in ENH: Add UMAP models and datasets #631 we will need a
Proposed dataset structure
An initial dataset format would look something like this
dataset/
train/
song1.wav.npz
song1.csv
song2.wav.npz
song2.csv
val/
song3.wav.npz
song3.csv
test/
song4.wav.npz
song4.csv
dataset.csv
# splits generated for learncurve
traindur-30s-replicate-1.csv
traindur-30s-replicate-1-source-id.npy
traindur-30s-replicate-1-source-inds.npy
traindur-30s-replicate-1-window-inds.npy
config.toml # config used to generate dataset
prep.log # log from run of prep
meta.json # any metadata
Deprecations
We will need to deprecate the spect_output_dir
option.
Metadata
Metadata
Assignees
Labels
ENH: enhancementenhancement; new feature or requestenhancement; new feature or request