Skip to content

[Bug] AttributeError: 'NoneType' object has no attribute 'load_wav' when using tts_with_vc_to_file #3143

@pprobst

Description

@pprobst

Describe the bug

Fix #3108 breaks tts_with_vc_to_file at least with VITS.

See:

TTS/TTS/api.py

Line 463 in 6fef4f9

self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name,speaker_wav=speaker_wav)

By changing the line from:
self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name,speaker_wav=speaker_wav)

To its pre-0.19.1 version:
self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name)

The issue is solved.

Please take a look at the script below for reproduction.

To Reproduce

Clone the Coqui TTS repository and install the dependencies as specified in the README file.
Then, run the following script from TTS's root directory, but replace speaker_wav with any audio file you have at hand:

#!/usr/bin/env python3

import torch
from TTS.api import TTS

device = "cuda" if torch.cuda.is_available() else "cpu"

tts = TTS("tts_models/pt/cv/vits").to(device)

tts.tts_with_vc_to_file(
    text="A radiografia apresentou algumas lesões no fêmur esquerdo ponto parágrafo",
    speaker_wav="test_audios/1693678335_24253176-processed.wav",
    file_path="test_audios/output.wav",
)

Expected behavior

The output audio file defined in file_path is generated, saying the sentence in text with the voice cloned from speaker_wav.

Logs

> tts_models/pt/cv/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.
 > initialization of language-embedding layers.
/home/probst/.pyenv/versions/coqui-tts/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
 > Text splitted to sentences.
['A radiografia apresentou algumas lesões no fêmur esquerdo ponto parágrafo']
Traceback (most recent call last):
  File "/home/probst/Projects/TTS-iara/./test.py", line 15, in <module>
    tts.tts_with_vc_to_file(
  File "/home/probst/Projects/TTS-iara/TTS/api.py", line 488, in tts_with_vc_to_file
    wav = self.tts_with_vc(text=text, language=language, speaker_wav=speaker_wav)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/probst/Projects/TTS-iara/TTS/api.py", line 463, in tts_with_vc
    self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name, speaker_wav=speaker_wav)
  File "/home/probst/Projects/TTS-iara/TTS/api.py", line 403, in tts_to_file
    wav = self.tts(text=text, speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/probst/Projects/TTS-iara/TTS/api.py", line 341, in tts
    wav = self.synthesizer.tts(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/probst/Projects/TTS-iara/TTS/utils/synthesizer.py", line 362, in tts
    speaker_embedding = self.tts_model.speaker_manager.compute_embedding_from_clip(speaker_wav)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/probst/Projects/TTS-iara/TTS/tts/utils/managers.py", line 365, in compute_embedding_from_clip
    embedding = _compute(wav_file)
                ^^^^^^^^^^^^^^^^^^
  File "/home/probst/Projects/TTS-iara/TTS/tts/utils/managers.py", line 342, in _compute
    waveform = self.encoder_ap.load_wav(wav_file, sr=self.encoder_ap.sample_rate)
               ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'load_wav'

Environment

- 🐸TTS Version: 0.19.1
- PyTorch Version: 2.1.0+cu121
- OS: Artix Linux

Not using GPU.
Installed everything through pip in a virtual environment created with pyenv.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions