Skip to content

[Bug] Custom model inference error [Unresolved] #3291

@78Alpha

Description

@78Alpha

Describe the bug

Inferencing custom model fails to work for various reasons (Language, unable to synthesize audio, unexpected pathing, json errors)

To Reproduce

1.) Finetune model/Train model on Ljspeech dataset
2.) Run "tts --text "Text for TTS" --model_path path/to/model --config_path path/to/config.json --out_path speech.wav --language en"
3.) Errors [Language None is not supported. | raise TypeError("Invalid file: {0!r}".format(self.name))]

Expected behavior

Produces a voice fil with which to evaluate the model

Logs

(coqui) alpha78@----------:/mnt/q/Utilities/CUDA/TTS/TTS/server$ tts --text "Text for TTS" --model_path ./tts_models/en/ljspeech/ --config_path ./tts_models/en/ljspeech/config.json --out_path speech.wav --language en
 > Using model: xtts
 > Text: Text for TTS
 > Text splitted to sentences.
['Text for TTS']
Traceback (most recent call last):
  File "/home/alpha78/anaconda3/envs/coqui/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/mnt/q/Utilities/CUDA/TTS/TTS/bin/synthesize.py", line 515, in main
    wav = synthesizer.tts(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/utils/synthesizer.py", line 374, in tts
    outputs = self.tts_model.synthesize(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 392, in synthesize
    return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 400, in inference_with_config
    "zh-cn" if language == "zh" else language in self.config.languages
AssertionError:  ❗ Language None is not supported. Supported languages are ['en', 'es', 'fr', 'de', 'it', 'pt', 'pl', 'tr', 'ru', 'nl', 'cs', 'ar', 'zh-cn', 'hu', 'ko', 'ja']


(coqui) alpha78@----------:/mnt/q/Utilities/CUDA/TTS/TTS/server$ tts --text "Text for TTS" --model_path ./tts_models/en/ljspeech/ --config_path ./tts_models/en/ljspeech/config.json --out_path speech.wav --language en
 > Using model: xtts
 > Text: Text for TTS
 > Text splitted to sentences.
['Text for TTS']
Traceback (most recent call last):
  File "/home/alpha78/anaconda3/envs/coqui/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/mnt/q/Utilities/CUDA/TTS/TTS/bin/synthesize.py", line 515, in main
    wav = synthesizer.tts(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/utils/synthesizer.py", line 374, in tts
    outputs = self.tts_model.synthesize(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 392, in synthesize
    return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 415, in inference_with_config
    return self.full_inference(text, ref_audio_path, language, **settings)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 476, in full_inference
    (gpt_cond_latent, speaker_embedding) = self.get_conditioning_latents(
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 351, in get_conditioning_latents
    audio = load_audio(file_path, load_sr)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 72, in load_audio
    audio, lsr = torchaudio.load(audiopath)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torchaudio/_backend/utils.py", line 204, in load
    return backend.load(uri, frame_offset, num_frames, normalize, channels_first, format, buffer_size)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torchaudio/_backend/soundfile.py", line 27, in load
    return soundfile_backend.load(uri, frame_offset, num_frames, normalize, channels_first, format)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torchaudio/_backend/soundfile_backend.py", line 221, in load
    with soundfile.SoundFile(filepath, "r") as file_:
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/soundfile.py", line 1212, in _open
    raise TypeError("Invalid file: {0!r}".format(self.name))
TypeError: Invalid file: None

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.1+cu121",
        "TTS": "0.20.6",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.13",
        "version": "#1 SMP Thu Oct 5 21:02:42 UTC 2023"
    }
}

Additional context

Documentation pages had two different ways to infer the model, neither worked.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions