h100: Worse output & 20x slower inference?

We're testing finetuning on an h100 and 4090, here are the results:

4090: https://voca.ro/11mtxzLHzzih
h100: https://voca.ro/15QldVjuG7nu

Almost identical finetune, but h100 is output is SIGNIFICANTLY worse. It isn't a config issue, and we've replicated it twice with LJSpeech as well.

4090 is also faster during training and considerably faster during inference, almost 20x faster than h100:

![Screenshot_2023-11-26_at_5 01 16_PM](https://github.com/yl4579/StyleTTS2/assets/40842550/b41f2fcf-5be4-4dbb-baf2-e2cdb35442cb)

h100:

![Screenshot_2023-11-24_at_3 46 08_PM](https://github.com/yl4579/StyleTTS2/assets/40842550/37ebda42-f54b-47b7-b4a6-3b560d1c75ff)



And during training, one epoch took the 4090 about 3 minutes, while the h100 took 4.12 minutes.



Does anyone know what could be going on here? Never seen an issue like this on an h100 before with a diffusion like model. Thanks


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

h100: Worse output & 20x slower inference? #89

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

h100: Worse output & 20x slower inference? #89

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions