Skip to content

Estimate Data Weights Requires Redundant Metadata, Needs to be removed #13279

@bonham79

Description

@bonham79

Is your feature request related to a problem? Please describe.

Estimate data weights works off input_cfg yaml file formats, but call to parser assumes existence of auto populated categories. This leads to specifying for each cutset passed to lhotse to have information such as:

  • lang_field
  • text_field
  • shard_seed
  • shuffle

While the top two are alright to require per cutset, passing the bottom two per manifest collection creates redundant information that should only be managed by passing information from a training dataset config.

Describe the solution you'd like

Add default options to nemo_tarred and similar cutsets that autofill in this information so won't raise error when call the script. That or have autofill defaults to use for the estimation script (this is less likely since you want to keep it agnostic to the type of cutset used).

Additional context

Add any other context or screenshots about the feature request here.

Current workaround

Image

Metadata

Metadata

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions