-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Description
Describe the bug
Fix #3108 breaks tts_with_vc_to_file
at least with VITS.
See:
Line 463 in 6fef4f9
self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name,speaker_wav=speaker_wav) |
By changing the line from:
self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name,speaker_wav=speaker_wav)
To its pre-0.19.1 version:
self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name)
The issue is solved.
Please take a look at the script below for reproduction.
To Reproduce
Clone the Coqui TTS repository and install the dependencies as specified in the README file.
Then, run the following script from TTS's root directory, but replace speaker_wav
with any audio file you have at hand:
#!/usr/bin/env python3
import torch
from TTS.api import TTS
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/pt/cv/vits").to(device)
tts.tts_with_vc_to_file(
text="A radiografia apresentou algumas lesões no fêmur esquerdo ponto parágrafo",
speaker_wav="test_audios/1693678335_24253176-processed.wav",
file_path="test_audios/output.wav",
)
Expected behavior
The output audio file defined in file_path
is generated, saying the sentence in text
with the voice cloned from speaker_wav
.
Logs
> tts_models/pt/cv/vits is already downloaded.
> Using model: vits
> Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:0
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:None
| > fft_size:1024
| > power:None
| > preemphasis:0.0
| > griffin_lim_iters:None
| > signal_norm:None
| > symmetric_norm:None
| > mel_fmin:0
| > mel_fmax:None
| > pitch_fmin:None
| > pitch_fmax:None
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:1.0
| > clip_norm:True
| > do_trim_silence:False
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:False
| > db_level:None
| > stats_path:None
| > base:10
| > hop_length:256
| > win_length:1024
> initialization of speaker-embedding layers.
> initialization of language-embedding layers.
/home/probst/.pyenv/versions/coqui-tts/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
> Text splitted to sentences.
['A radiografia apresentou algumas lesões no fêmur esquerdo ponto parágrafo']
Traceback (most recent call last):
File "/home/probst/Projects/TTS-iara/./test.py", line 15, in <module>
tts.tts_with_vc_to_file(
File "/home/probst/Projects/TTS-iara/TTS/api.py", line 488, in tts_with_vc_to_file
wav = self.tts_with_vc(text=text, language=language, speaker_wav=speaker_wav)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/probst/Projects/TTS-iara/TTS/api.py", line 463, in tts_with_vc
self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name, speaker_wav=speaker_wav)
File "/home/probst/Projects/TTS-iara/TTS/api.py", line 403, in tts_to_file
wav = self.tts(text=text, speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/probst/Projects/TTS-iara/TTS/api.py", line 341, in tts
wav = self.synthesizer.tts(
^^^^^^^^^^^^^^^^^^^^^
File "/home/probst/Projects/TTS-iara/TTS/utils/synthesizer.py", line 362, in tts
speaker_embedding = self.tts_model.speaker_manager.compute_embedding_from_clip(speaker_wav)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/probst/Projects/TTS-iara/TTS/tts/utils/managers.py", line 365, in compute_embedding_from_clip
embedding = _compute(wav_file)
^^^^^^^^^^^^^^^^^^
File "/home/probst/Projects/TTS-iara/TTS/tts/utils/managers.py", line 342, in _compute
waveform = self.encoder_ap.load_wav(wav_file, sr=self.encoder_ap.sample_rate)
^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'load_wav'
Environment
- 🐸TTS Version: 0.19.1
- PyTorch Version: 2.1.0+cu121
- OS: Artix Linux
Not using GPU.
Installed everything through pip in a virtual environment created with pyenv.
Additional context
No response