Skip to content

Conversation

kmn1024
Copy link
Contributor

@kmn1024 kmn1024 commented Nov 24, 2023

SLM joint training bug in finetuning code: #72 (comment)

@yl4579
Copy link
Owner

yl4579 commented Nov 24, 2023

It’s probably related to this: #15
I couldn’t reproduce because my pytorch version doesn’t have this problem. Does the order of backward matter though? Do you also have to change the order of the generator loss backward?

@kmn1024
Copy link
Contributor Author

kmn1024 commented Nov 24, 2023

Sorry you are correct! The order of backwards also needs to be changed. The run now works on my setup.

For completeness, this if my setup:

> python -c "import torch; print(torch.version.cuda)"
12.1
> python -c "import torch; print(torch.__version__)"
2.1.1
> nvidia-smi
Fri Nov 24 04:49:56 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:05:00.0 Off |                  Off |
| 30%   51C    P2             118W / 300W |  32545MiB / 49140MiB |     28%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               On  | 00000000:45:00.0 Off |                  Off |
| 30%   53C    P2             111W / 300W |  27637MiB / 49140MiB |     30%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               On  | 00000000:85:00.0 Off |                  Off |
| 30%   55C    P2             113W / 300W |  27617MiB / 49140MiB |     22%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000               On  | 00000000:C5:00.0 Off |                  Off |
| 30%   52C    P2             107W / 300W |  27529MiB / 49140MiB |     26%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

@yl4579 yl4579 merged commit 23c16b7 into yl4579:main Nov 24, 2023
@yl4579
Copy link
Owner

yl4579 commented Nov 24, 2023

I actually found a bug in this fix. It calls optimizer.zero_grad() twice when the discriminator loss isn’t 0, so it implicitly overwrites the gradient in that iteration and optimizes against the generator loss. I think we have to bring optimizer steps lines before the discriminator line as well.

@yl4579
Copy link
Owner

yl4579 commented Nov 24, 2023

Can you please make the change and test if it doesn’t cause any problems in your settings?

@kmn1024
Copy link
Contributor Author

kmn1024 commented Nov 24, 2023

Ah, another good catch. Trying...

This issue also seems to be in train_second.py?

@yl4579
Copy link
Owner

yl4579 commented Nov 24, 2023

Yes, so if you could verify it has no problems running on your system I’ll change that too. I couldn’t reproduce in my environment.

yl4579 added a commit that referenced this pull request Nov 24, 2023
Continued fix of SLM training (see #74)
yl4579 added a commit that referenced this pull request Nov 24, 2023
nawed2611 pushed a commit to team-listnr/StyleTTS2 that referenced this pull request Feb 8, 2024
nawed2611 pushed a commit to team-listnr/StyleTTS2 that referenced this pull request Feb 8, 2024
…ut when . See yl4579#74 (comment). Also tuck all logic related to SLM under
nawed2611 pushed a commit to team-listnr/StyleTTS2 that referenced this pull request Feb 8, 2024
nawed2611 pushed a commit to team-listnr/StyleTTS2 that referenced this pull request Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants