-
Notifications
You must be signed in to change notification settings - Fork 613
Open
Labels
help wantedExtra attention is neededExtra attention is needed
Description
We're testing finetuning on an h100 and 4090, here are the results:
4090: https://voca.ro/11mtxzLHzzih
h100: https://voca.ro/15QldVjuG7nu
Almost identical finetune, but h100 is output is SIGNIFICANTLY worse. It isn't a config issue, and we've replicated it twice with LJSpeech as well.
4090 is also faster during training and considerably faster during inference, almost 20x faster than h100:
h100:
And during training, one epoch took the 4090 about 3 minutes, while the h100 took 4.12 minutes.
Does anyone know what could be going on here? Never seen an issue like this on an h100 before with a diffusion like model. Thanks
Metadata
Metadata
Assignees
Labels
help wantedExtra attention is neededExtra attention is needed