Skip to content

h100: Worse output & 20x slower inference? #89

@addytheyoung

Description

@addytheyoung

We're testing finetuning on an h100 and 4090, here are the results:

4090: https://voca.ro/11mtxzLHzzih
h100: https://voca.ro/15QldVjuG7nu

Almost identical finetune, but h100 is output is SIGNIFICANTLY worse. It isn't a config issue, and we've replicated it twice with LJSpeech as well.

4090 is also faster during training and considerably faster during inference, almost 20x faster than h100:

Screenshot_2023-11-26_at_5 01 16_PM

h100:

Screenshot_2023-11-24_at_3 46 08_PM

And during training, one epoch took the 4090 about 3 minutes, while the h100 took 4.12 minutes.

Does anyone know what could be going on here? Never seen an issue like this on an h100 before with a diffusion like model. Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions