Skip to content

Conversation

tyrauber
Copy link

Correct me if I am wrong, but the ggml fork is not required.

This PR migrates to the official ggml submodule, and extracts the handful of tts methods required to src/ggml-tts-ext.cpp.

Tested on the cli with both variants of the Dia Q4 models you provided.

Now the library can keep current with ggml and benefit from recent changes.

Should help with performance and platform support.

Thank you very much for this library.

tyrauber added 3 commits May 26, 2025 14:50
…ayer - Replace mmwillet/ggml with official llama.cpp/ggml - Add TTS extension layer with missing functions - Increase GGML_MAX_NAME to 128 - Keep llama.cpp pristine - Successfully tested with Q4 models
…willet/ggml submodule - Update examples to link only to tts library - Fix STFT/ISTFT with proper inverse FFT - Implement all required TTS functions - Maintain llama.cpp pristine - Successfully tested with both models - Audio quality significantly improved
…l-org/ggml as submodule for latest GGML features - Update CMakeLists.txt to use standalone GGML instead of llama.cpp bundled version - Enable ARM-specific optimizations (DOTPROD, MATMUL_INT8, FMA, FP16_VECTOR_ARITHMETIC, SME) - Enable Metal backend for GPU acceleration on macOS - Enable BLAS support with Accelerate framework - Maintain TTS extension compatibility with standard GGML API - Successfully tested with both Dia_Q4.gguf and Dia_Q4_DAC_F16.gguf models - Clean build without duplicate library warnings
@tyrauber tyrauber marked this pull request as draft May 27, 2025 00:47

[[noreturn]] void tts_abort(const char * file, int line, const char * fmt, ...);
[[noreturn]] void tts_abort(const char *file, int line, const char *fmt, ...);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this spacing changed away from the ggml-org code style?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielzgtg, that's a good question. Wasn't intentional. I'll see if I can revert that.

@tyrauber
Copy link
Author

Actually, looks like this missed metal support. Marked this PR as a draft until resolved.

@danielzgtg
Copy link
Collaborator

Thank you for your concern in allowing us to "keep current with ggml and benefit from recent changes."

We were originally planning to rebase on top of latest ggml. As there wasn't any pressing need to do so, I didn't do that yet. In the long run, I hope to upstream stuff like ggml_squeeze once the code is clean enough.

Soon (tm), I need the following patch in ggml to fix a linker error:

diff --cc src/ggml-hip/CMakeLists.txt
index fccf8eb,fccf8eb..11e269a
--- a/src/ggml-hip/CMakeLists.txt
+++ b/src/ggml-hip/CMakeLists.txt
@@@ -7,6 -7,6 +7,8 @@@ if (NOT EXISTS $ENV{ROCM_PATH}
  else()
      set(ROCM_PATH $ENV{ROCM_PATH})
  endif()
++add_link_options(-fPIE)
++add_compile_options(-fPIE)
  
  list(APPEND CMAKE_PREFIX_PATH  ${ROCM_PATH})
  list(APPEND CMAKE_PREFIX_PATH "${ROCM_PATH}/lib64/cmake")

Is there any way mmwillet/ggml can be updated? I'm somewhat open to merging this PR into TTS.cpp so we can have the small benefit of accessing my recently merged ggml backtrace PR. However, we will need to fork again if there is a need to add or modify HIP/CUDA code.

@tyrauber
Copy link
Author

I am running to issues with the other models. Specifically, the Kokoro gguf linked in the readme.

The forked mmwillet/ggml supports conv_transpose_1d operations with non-zero padding values
The main ggml-org/ggml repository has a strict assertion: GGML_ASSERT(p0 == 0) that requires padding=0 for conv_transpose_1d operations. Kokoro models use conv_transpose_1d layers with padding=1 in their generator upsample layers.

I think py-gguf/tts_encoders/kokoro_gguf_encoder.py should be adjusted with Bias Tensor Adjustment, that automatically resizes bias tensors when conv_transpose_1d padding changes from non-zero to zero, making kokoro gguf models compatible with ggml main. Thoughts?

@danielzgtg
Copy link
Collaborator

danielzgtg commented May 27, 2025

other models. Specifically, the Kokoro gguf

At this point that's the main model and basically the only fast-enough model.

ggml-org/ggml repository has a strict assertion

That assertion means "not yet implemented", or else the function parameter p0 should be removed and inline because it'd be a constant otherwise. This feature should be implemented in upstream.

kokoro_gguf_encoder.py should be adjusted with Bias Tensor Adjustment

I'd rather not break compatibility with existing files. We don't have versioning and I can't unpublish other people's Hugging Face existing models that are linked in various places.

What won't break compatibility is this: We can do any fixup in the assign_weights functions of each model .cpp. This'd reduce the percent available for a future mmap feature but whatever. We're already doing transposes and other stride tricks. I'm also thinking of doing the "increment the gamma bias by one" at load-time like this.

@tyrauber
Copy link
Author

100%. Dia is painfully slow. I absolutely agree in regards to breaking backwards compatibility. I am trying to figure out a workaround in ggml-tts-ext.h to handle this discrepancy without having to fork and patch ggml.

@mmwillet
Copy link
Owner

@tyrauber Sorry for my absence. Currently, the ggml fork is required, as there are several major modifications made in order to properly support various behaviors. I'll work to upgrade the patch this week and update documentation to reflect this.

@mmwillet
Copy link
Owner

am trying to figure out a workaround in ggml-tts-ext.h to handle this discrepancy without having to fork and patch ggml.

One possible solution to avoid having to patch ggml entirely would be to reimplement added operations as custom map operations. I chose to avoid this initially because the custom map function is limited, has historically been bug prone, and is not compatible with acceleration. Let me see if I can hack together a working work around this week.

@tyrauber
Copy link
Author

Hey @mmwillet, no worries. Don't mean to create a ton of unnecessary work. Just fascinated by all this and trying to understand.

Is there any documentation on the specific requirements that necessitated the ggml fork? I'm trying to understand what functionality is required for TTS that isn't provided in standard ggml.

From examining the fork's changes, it appears the main modifications are:

  • Conv transpose 1D with non-zero padding - Standard ggml's conv_transpose_1d enforces p0 == 0, but the fork removes this constraint
  • Custom STFT/ISTFT implementations - Audio processing operations for magnitude/phase handling
  • Linear upscaling for 1D sequences - Different from standard nearest-neighbor upscaling
  • Metal acceleration support for these custom operations

Are these modifications primarily due to:

  • Specific tensor layout requirements in existing model checkpoints?
  • Audio processing needs that standard ggml doesn't address?
  • Performance optimizations for TTS workloads?
  • Model encoding choices?

From my experience with this branch, working around the transpose 1D with non-zero padding is a huge pain in the ass. I tried to code around the issue and ended up with 600 lines of custom functions. Which is almost as bad as a fork.

I really think it is important to identify exactly what ggml needs to support tts and get that merged upstream.

@mmwillet
Copy link
Owner

mmwillet commented May 28, 2025

@tyrauber

Is there any documentation on the specific requirements that necessitated the ggml fork? I'm trying to understand what functionality is required for TTS that isn't provided in standard ggml.

No, I have not put together any documentation on this outside of scattered tickets and/or inline comments. I will do so on the primary README later today.

In terms of what the modifications are I think you mostly outlined them, but I'll reiterate and amend here for posterity:

  • Conv transpose 1D with non-zero padding - Standard ggml's conv_transpose_1d enforces p0 == 0, but the fork removes this constraint

As noted by yourself, the biggest gap stems from the native GGML convolution transposition implementation which doesn't support full parity with Tensorflow or PyTorch. Besides not supporting non-zero padding or output padding, it also doesn't support grouping which is used by Kokoro (though the support that I added for grouping is still somewhat incomplete).

  • Custom STFT/ISTFT implementations - Audio processing operations for magnitude/phase handling

To a certain degree, it is not necessary to support these operations as part of the graph (as I recall they are not implemented as part of the graph in llama.cpp) and it should be possible to support these operations in their current state via GGML's custom_map ops. Acceleration of these operations would however require that they are independently implemented as their own GGML ops (e.g. good threading support and shader support for metal, cuda, etc).

  • Linear upscaling for 1D sequences - Different from standard nearest-neighbor upscaling

This could currently be a custom map operation and it is unlikely that implementing this as such would result in any significant performance degradation even when accelerating the model via CUDA/metal ops (it is so fast and can just be piped to the cpu with little time loss). My implementation of it is also sloppy. For better parity with PyTorch or Tensorflow, it really should be implemented as a passable option to the generic upscaling operation (along with support for 'bilinear', 'bicubic', 'trilinear', 'area', and 'nearest-exact' modes as well) rather than its own operation.

  • Metal acceleration support for these custom operations

This and CUDA + other shader-like acceleration protocols was the main long term motivator to implement these as native (rather than custom) operations, but I have failed to make the headway that I expected to.

  • Specific tensor layout requirements in existing model checkpoints?

This is, at least at face value, why these operations exist in some form. In order for the GGML graph to have parity with a pytorch implementation it must support the same underlying behaviors. GGML currently doesn't support these out of the box so I added them. As I noted though, some of these operations could be moved into custom operations or performed via inline transformations and the main reason not to do that is to support better acceleration.

  • Audio processing needs that standard ggml doesn't address?

GGML doesn't have good support for convolutional operations which are heavily used in almost every TTS model. Additionally the usage of FFT in TTS is another place where parity is missing, but this has also historically been the case for other tensor libraries (including TF and PyTorch), and custom operations could likely be built.

Something additional to note about the lack of convolutional parity is that in my non-OS work I've run into this exact issue with vision models on GGML and have built out bespoke stuff to properly support 2d convolutional operations for my own personal use. I wouldn't be surprised if in the OS space you could find some folks working on vision that have been forced to implement similar functionality to us.

  • Performance optimizations for TTS workloads?

Performance optimizations would mostly be on convolutional operations for the time being (which isn't TTS specific per say). My vision would include better metal and gpu support for FFT, but at this point I'd be happy to pipe that to the cpu.

  • Model encoding choices?

I think @danielzgtg noted that there are some model encoding choices that may have influenced how we chose to use GGML, but for the most part these things have resulted in minor TTS side code changes. E.G. GGML doesn't have a simple way of performing inline operations against a scalar, so if we need to, for example, multiply a tensor by 2 we might simply add it against itself or perform that operation at gguf encoding.

I really think it is important to identify exactly what ggml needs to support tts and get that merged upstream.

I agree, this would 100% be the best outcome.

@tyrauber
Copy link
Author

@mmwillet Thank you for the comprehensive breakdown! This really clarifies the scope and nature of the TTS requirements of ggml. And helps illustrate why the custom fork of ggml is currently a requirement.

To summarize. Core upstream ggml changes would be:

  1. Enhanced Conv Transpose 1D
    Remove the p0 == 0 constraint and add support for non-zero padding. Maintain backward compatibility (existing calls with p0=0 work unchanged). This is the most critical blocker and affects multiple model types. The lack of full PyTorch/TensorFlow parity in GGML's convolution operations seems like a significant gap that would benefit the broader ecosystem.

  2. Convolution Grouping Support
    Add grouping parameter to convolution operations. Required by Kokoro and common in modern CNN architectures. Extend existing conv operations with optional grouping parameter.

  3. Enhanced Upscaling Operation
    Extend ggml_upscale with multiple interpolation modes. Better PyTorch/TensorFlow parity, removes need for custom ggml_upscale_linear. Maybe nearest (existing), linear (1D sequences), bilinear, bicubic, trilinear, area, nearest-exact.

  4. Reciprocal Operation
    Add ggml_reciprocal as a standard operation. Common mathematical operation, simple to implement. Straightforward addition with Metal/CUDA acceleration

My primary target for this library is mobile devices, so I was trying to squeeze as much performance and acceleration out of the library as possible. As I dug into the code, I ran into all these questions. So, I really appreciate you sharing your time, knowledge and experience.

I think the documentation you mentioned adding to the README would be incredibly valuable for the community! I really think if these changes could get upstreamed, the library, and the ecosystem as a whole, would benefit tremendously.

Happy to close this PR, since my original understanding and assessment was faulty. Hopefully my thoughts and questions were in some way helpful. Thanks again for this library!

@danielzgtg
Copy link
Collaborator

Custom STFT/ISTFT implementations

@mmwillet It would be nice if your implementation could be unified with the one in src/whisper.cpp and the result be upstreamed to ggml

inline operations against a scalar

ggml_reciprocal

Do stride tricks work? There's no speedup for multiplication by 2, but a reciprocal is 1/x. In numpy the 1 is a 1-element n-dimensional array that can be broadcasted by setting the dimensions to anything and all the strides to 0. We can see if this works in ggml.

@mmwillet
Copy link
Owner

@mmwillet
Copy link
Owner

@tyrauber Thanks for your input on this. I am going to close this PR for now, but happy to reopen if our approach changes and we need to return to it.

@mmwillet mmwillet closed this May 29, 2025
@tyrauber
Copy link
Author

@mmwillet, I got a branch where the custom tts ggml methods are extracted to an ext class, but if we are upstreaming - which is definitely the right approach - I don't think you need it. If helpful let me know.

When you get around to it, some documentation around tts ggml requirements would be really helpful. I still don't understand why tts requires a zero pad conv. Why did the Dia model work without that? Is that because different tts models have different training, layers, etc. I am sure there is a reason, I just don't grok it yet. If you could braindump how ggml tts works, and why it requires unconventional padding and grouping, I think it would be hugely beneficial. Help contributors get up to speed. Right now, it's a lot of trial and error, and printf.

Thank you! Fantastic library. Really appreciate your and @danielzgtg's responsiveness and willingness to discuss and share. Mad respect.

@tyrauber
Copy link
Author

llama.rn just added TTS multi-modal support using OuteTTS. llama.rn is a react-native wrapper around ggml-org/llama.cpp. llama.cpp is using a custom version of ggml. I think active development of ggml is happening there. The ggml repo mentions "some of the development is currently happening in the llama.cpp and whisper.cpp repos."

@mmwillet, @danielzgtg, Have either of you checked to see if llama.cpp/ggml has the required TTS methods?

Perhaps, instead of switching to ggml main, it would make sense to switch to llama.cpp, where active development of ggml is happening?

@danielzgtg
Copy link
Collaborator

@tyrauber
Copy link
Author

Ha! Too funny! Got it. So those scripts are keeping whisper.cpp and llama.cpp up-to-date with ggml main. And you are one of the people leading that effort. Awesome. Thank you!

So that answers the question. Swapping to llama.cpp wouldn't change anything, because it is current to main. (Would have been nice if it was a submodule so this association was obvious. It was difficult to know if it was a fork, or what version it was.)

This leave the same questions asked above though. Without the upstream changes:

  • How was I able to get Dia to work using existing ggml methods?
  • How was llama.rn able to add support for TTS with OuteTTS?

Are these core upstream changes outlined above only needed for Kokoro and Parler?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants