Skip to content

TTS Finetune / TTS3对multi-speaker数据进行微调 #2442

@dc3ea9f

Description

@dc3ea9f

您好,我在使用examples/other/tts_finetune/tts3(commit_id 863609) finetune自己的数据集时遇到了问题:

example只提供了在csmsc_mini single-speaker上finetune的tutorial,但是对于tune on multi speaker dataset仍然是不可用的

为了finetune on multi speaker dataset,我尝试通过MFA align来获取音素的duration,但使用./tools/montreal-forced-aligner/bin/mfa_align时会有一部分文件无法生成TextGrid结果,查看log显示:

WARNING (gmm-align-compiled[5.4.247~1-2148]:main():gmm-align-compiled.cc:103) No features for utterance 000xxx

而后,我尝试使用latest MFA (from conda)与repo中提供的字典和AM提取音素duration,可正常生成结果,但是在训练一些step后会产生维度匹配错误,我想咨询下我的处理流程是否有问题?为什么在训练过程中会有bug?如果想保证训练过程的正常进行,应如何修改?

multiple speaker fastspeech2!
spk_num: 174
samplers done!
dataloaders done!
vocab_size: 306
W0923 19:56:10.536396 43391 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.4, Runtime API Version: 10.1
W0923 19:56:10.542753 43391 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
model done!
optimizer done!
/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/nn/layer/norm.py:653: UserWarning: When training, we now always track global mean and variance.
  warnings.warn(
INFO 2022-09-23 19:56:16,280 trainer.py:167]  iter: 96401/1200, Rank: 0, l1_loss: 1.856183, duration_loss: 0.316927, pitch_loss: 0.978752, energy_loss: 6.169956, loss: 9.321817, avg_reader_cost: 0.00075 sec, avg_batch_cost: 2.49490 sec, avg_samples: 32, avg_ips: 12.82615 sequences/sec
INFO 2022-09-23 19:56:16,432 trainer.py:167]  iter: 96402/1200, Rank: 0, l1_loss: 1.837183, duration_loss: 0.331705, pitch_loss: 2.075590, energy_loss: 5.443483, loss: 9.687962, avg_reader_cost: 0.00023 sec, avg_batch_cost: 0.15020 sec, avg_samples: 32, avg_ips: 213.05078 sequences/sec
INFO 2022-09-23 19:56:16,797 trainer.py:167]  iter: 96403/1200, Rank: 0, l1_loss: 1.924996, duration_loss: 0.372018, pitch_loss: 1.120224, energy_loss: 5.148971, loss: 8.566210, avg_reader_cost: 0.00028 sec, avg_batch_cost: 0.36255 sec, avg_samples: 32, avg_ips: 88.26467 sequences/sec
INFO 2022-09-23 19:56:17,012 trainer.py:167]  iter: 96404/1200, Rank: 0, l1_loss: 1.770270, duration_loss: 0.278578, pitch_loss: 0.868206, energy_loss: 4.675483, loss: 7.592536, avg_reader_cost: 0.00017 sec, avg_batch_cost: 0.21332 sec, avg_samples: 32, avg_ips: 150.00853 sequences/sec
INFO 2022-09-23 19:56:17,226 fastspeech2_updater.py:174] Evaluate: l1_loss: 2.153552, duration_loss: 0.385811, pitch_loss: 0.574073, energy_loss: 5.581622, loss: 8.695057
INFO 2022-09-23 19:56:20,039 trainer.py:167]  iter: 96405/1200, Rank: 0, l1_loss: 1.780190, duration_loss: 0.277861, pitch_loss: 1.635613, energy_loss: 5.039042, loss: 8.732705, avg_reader_cost: 0.24950 sec, avg_batch_cost: 0.51928 sec, avg_samples: 32, avg_ips: 61.62427 sequences/sec
INFO 2022-09-23 19:56:20,205 trainer.py:167]  iter: 96406/1200, Rank: 0, l1_loss: 1.688078, duration_loss: 0.244555, pitch_loss: 0.762871, energy_loss: 4.326387, loss: 7.021891, avg_reader_cost: 0.00062 sec, avg_batch_cost: 0.16330 sec, avg_samples: 32, avg_ips: 195.95860 sequences/sec
INFO 2022-09-23 19:56:20,473 trainer.py:167]  iter: 96407/1200, Rank: 0, l1_loss: 1.639309, duration_loss: 0.291534, pitch_loss: 0.861395, energy_loss: 4.975361, loss: 7.767599, avg_reader_cost: 0.00019 sec, avg_batch_cost: 0.26607 sec, avg_samples: 32, avg_ips: 120.26807 sequences/sec
INFO 2022-09-23 19:56:20,654 trainer.py:167]  iter: 96408/1200, Rank: 0, l1_loss: 1.648149, duration_loss: 0.320221, pitch_loss: 0.792124, energy_loss: 4.360466, loss: 7.120960, avg_reader_cost: 0.00018 sec, avg_batch_cost: 0.17950 sec, avg_samples: 32, avg_ips: 178.27264 sequences/sec
INFO 2022-09-23 19:56:20,908 fastspeech2_updater.py:174] Evaluate: l1_loss: 2.084182, duration_loss: 0.419054, pitch_loss: 0.366559, energy_loss: 4.915794, loss: 7.785589
INFO 2022-09-23 19:56:23,792 trainer.py:167]  iter: 96409/1200, Rank: 0, l1_loss: 1.634000, duration_loss: 0.245067, pitch_loss: 1.825069, energy_loss: 4.637261, loss: 8.341396, avg_reader_cost: 0.27610 sec, avg_batch_cost: 0.54012 sec, avg_samples: 32, avg_ips: 59.24561 sequences/sec
INFO 2022-09-23 19:56:24,009 trainer.py:167]  iter: 96410/1200, Rank: 0, l1_loss: 1.596452, duration_loss: 0.312945, pitch_loss: 0.717812, energy_loss: 3.500055, loss: 6.127264, avg_reader_cost: 0.00023 sec, avg_batch_cost: 0.21478 sec, avg_samples: 32, avg_ips: 148.99175 sequences/sec
INFO 2022-09-23 19:56:24,189 trainer.py:167]  iter: 96411/1200, Rank: 0, l1_loss: 1.579894, duration_loss: 0.246891, pitch_loss: 0.851606, energy_loss: 4.034445, loss: 6.712835, avg_reader_cost: 0.00030 sec, avg_batch_cost: 0.17803 sec, avg_samples: 32, avg_ips: 179.74425 sequences/sec
INFO 2022-09-23 19:56:24,458 trainer.py:167]  iter: 96412/1200, Rank: 0, l1_loss: 1.544847, duration_loss: 0.266166, pitch_loss: 0.645351, energy_loss: 4.618836, loss: 7.075200, avg_reader_cost: 0.00021 sec, avg_batch_cost: 0.26733 sec, avg_samples: 32, avg_ips: 119.70131 sequences/sec
INFO 2022-09-23 19:56:24,691 fastspeech2_updater.py:174] Evaluate: l1_loss: 2.006938, duration_loss: 0.450204, pitch_loss: 0.262614, energy_loss: 4.423265, loss: 7.143022
INFO 2022-09-23 19:56:27,653 trainer.py:167]  iter: 96413/1200, Rank: 0, l1_loss: 1.563614, duration_loss: 0.288662, pitch_loss: 1.887361, energy_loss: 3.617754, loss: 7.357391, avg_reader_cost: 0.27405 sec, avg_batch_cost: 0.53012 sec, avg_samples: 32, avg_ips: 60.36321 sequences/sec
INFO 2022-09-23 19:56:27,927 trainer.py:167]  iter: 96414/1200, Rank: 0, l1_loss: 1.547978, duration_loss: 0.275924, pitch_loss: 0.660862, energy_loss: 3.920330, loss: 6.405094, avg_reader_cost: 0.00029 sec, avg_batch_cost: 0.27138 sec, avg_samples: 32, avg_ips: 117.91451 sequences/sec
INFO 2022-09-23 19:56:28,144 trainer.py:167]  iter: 96415/1200, Rank: 0, l1_loss: 1.496017, duration_loss: 0.253219, pitch_loss: 0.574959, energy_loss: 3.712186, loss: 6.036382, avg_reader_cost: 0.00030 sec, avg_batch_cost: 0.21546 sec, avg_samples: 32, avg_ips: 148.51825 sequences/sec
INFO 2022-09-23 19:56:28,325 trainer.py:167]  iter: 96416/1200, Rank: 0, l1_loss: 1.476836, duration_loss: 0.215573, pitch_loss: 0.774607, energy_loss: 4.139956, loss: 6.606973, avg_reader_cost: 0.00029 sec, avg_batch_cost: 0.17925 sec, avg_samples: 32, avg_ips: 178.51734 sequences/sec
INFO 2022-09-23 19:56:28,529 fastspeech2_updater.py:174] Evaluate: l1_loss: 1.925813, duration_loss: 0.436651, pitch_loss: 0.225853, energy_loss: 4.057889, loss: 6.646207
INFO 2022-09-23 19:56:31,268 trainer.py:167]  iter: 96417/1200, Rank: 0, l1_loss: 1.507729, duration_loss: 0.278215, pitch_loss: 0.658901, energy_loss: 3.469983, loss: 5.914828, avg_reader_cost: 0.28339 sec, avg_batch_cost: 0.48521 sec, avg_samples: 32, avg_ips: 65.95148 sequences/sec
INFO 2022-09-23 19:56:31,511 trainer.py:167]  iter: 96418/1200, Rank: 0, l1_loss: 1.498667, duration_loss: 0.226933, pitch_loss: 0.807640, energy_loss: 3.796876, loss: 6.330114, avg_reader_cost: 0.00028 sec, avg_batch_cost: 0.24016 sec, avg_samples: 32, avg_ips: 133.24319 sequences/sec
Exception in main training loop: (InvalidArgument) The value (281) of the non-singleton dimension does not match the corresponding value (289) in shape for expand_v2 op.
  [Hint: Expected vec_in_dims[i] == expand_shape[i], but received vec_in_dims[i]:281 != expand_shape[i]:289.] (at /paddle/paddle/phi/kernels/impl/expand_kernel_impl.h:61)
  [operator < expand_v2 > error]
Traceback (most recent call last):
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/training/trainer.py", line 149, in run
    update()
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/training/updaters/standard_updater.py", line 110, in update
    self.update_core(batch)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/models/fastspeech2/fastspeech2_updater.py", line 63, in update_core
    before_outs, after_outs, d_outs, p_outs, e_outs, ys, olens = self.model(
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/models/fastspeech2/fastspeech2.py", line 550, in forward
    before_outs, after_outs, d_outs, p_outs, e_outs = self._forward(
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/models/fastspeech2/fastspeech2.py", line 667, in _forward
    zs, _ = self.decoder(hs, h_masks)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/transformer/encoder.py", line 409, in forward
    xs, masks = self.encoders(xs, masks)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/transformer/repeat.py", line 25, in forward
    args = m(*args)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/transformer/encoder_layer.py", line 99, in forward
    x = residual + self.dropout(self.self_attn(x_q, x, x, mask))
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/transformer/attention.py", line 144, in forward
    return self.forward_attention(v, scores, mask)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/transformer/attention.py", line 107, in forward_attention
    scores = masked_fill(scores, mask, min_value)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/masked_fill.py", line 44, in masked_fill
    mask = mask.broadcast_to(bshape)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/tensor/manipulation.py", line 1917, in broadcast_to
    return _C_ops.expand_v2(x, 'shape', shape)
Trainer extensions will try to handle the extension. Then all extensions will finalize.Traceback (most recent call last):
  File "/home/xxxx/icassp_workspace/PaddleSpeech/examples/other/tts_finetune/tts3/local/finetune.py", line 269, in <module>
    train_sp(train_args, config)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/examples/other/tts_finetune/tts3/local/finetune.py", line 202, in train_sp
    trainer.run()
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/training/trainer.py", line 198, in run
    six.reraise(*exc_info)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/training/trainer.py", line 149, in run
    update()
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/training/updaters/standard_updater.py", line 110, in update
    self.update_core(batch)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/models/fastspeech2/fastspeech2_updater.py", line 63, in update_core
    before_outs, after_outs, d_outs, p_outs, e_outs, ys, olens = self.model(
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/models/fastspeech2/fastspeech2.py", line 550, in forward
    before_outs, after_outs, d_outs, p_outs, e_outs = self._forward(
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/models/fastspeech2/fastspeech2.py", line 667, in _forward
    zs, _ = self.decoder(hs, h_masks)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/transformer/encoder.py", line 409, in forward
    xs, masks = self.encoders(xs, masks)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/transformer/repeat.py", line 25, in forward
    args = m(*args)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/transformer/encoder_layer.py", line 99, in forward
    x = residual + self.dropout(self.self_attn(x_q, x, x, mask))
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/transformer/attention.py", line 144, in forward
    return self.forward_attention(v, scores, mask)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/transformer/attention.py", line 107, in forward_attention
    scores = masked_fill(scores, mask, min_value)
  File "/home/xxxx/icassp_workspace/PaddleSpeech/paddlespeech/t2s/modules/masked_fill.py", line 44, in masked_fill
    mask = mask.broadcast_to(bshape)
  File "/home/xxxx/.custom/cuda-11.4.2-cudnn8-devel-ubuntu20.04-pytorch1.9.0_full_tensorboard/envs/paddle_env/lib/python3.9/site-packages/paddle/tensor/manipulation.py", line 1917, in broadcast_to
    return _C_ops.expand_v2(x, 'shape', shape)
ValueError: (InvalidArgument) The value (281) of the non-singleton dimension does not match the corresponding value (289) in shape for expand_v2 op.
  [Hint: Expected vec_in_dims[i] == expand_shape[i], but received vec_in_dims[i]:281 != expand_shape[i]:289.] (at /paddle/paddle/phi/kernels/impl/expand_kernel_impl.h:61)
  [operator < expand_v2 > error]

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions