-
Notifications
You must be signed in to change notification settings - Fork 17
feat: migrate to ggml main #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ayer - Replace mmwillet/ggml with official llama.cpp/ggml - Add TTS extension layer with missing functions - Increase GGML_MAX_NAME to 128 - Keep llama.cpp pristine - Successfully tested with Q4 models
…willet/ggml submodule - Update examples to link only to tts library - Fix STFT/ISTFT with proper inverse FFT - Implement all required TTS functions - Maintain llama.cpp pristine - Successfully tested with both models - Audio quality significantly improved
…l-org/ggml as submodule for latest GGML features - Update CMakeLists.txt to use standalone GGML instead of llama.cpp bundled version - Enable ARM-specific optimizations (DOTPROD, MATMUL_INT8, FMA, FP16_VECTOR_ARITHMETIC, SME) - Enable Metal backend for GPU acceleration on macOS - Enable BLAS support with Accelerate framework - Maintain TTS extension compatibility with standard GGML API - Successfully tested with both Dia_Q4.gguf and Dia_Q4_DAC_F16.gguf models - Clean build without duplicate library warnings
|
||
[[noreturn]] void tts_abort(const char * file, int line, const char * fmt, ...); | ||
[[noreturn]] void tts_abort(const char *file, int line, const char *fmt, ...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this spacing changed away from the ggml-org code style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielzgtg, that's a good question. Wasn't intentional. I'll see if I can revert that.
Actually, looks like this missed metal support. Marked this PR as a draft until resolved. |
Thank you for your concern in allowing us to "keep current with ggml and benefit from recent changes." We were originally planning to rebase on top of latest ggml. As there wasn't any pressing need to do so, I didn't do that yet. In the long run, I hope to upstream stuff like Soon (tm), I need the following patch in ggml to fix a linker error: diff --cc src/ggml-hip/CMakeLists.txt
index fccf8eb,fccf8eb..11e269a
--- a/src/ggml-hip/CMakeLists.txt
+++ b/src/ggml-hip/CMakeLists.txt
@@@ -7,6 -7,6 +7,8 @@@ if (NOT EXISTS $ENV{ROCM_PATH}
else()
set(ROCM_PATH $ENV{ROCM_PATH})
endif()
++add_link_options(-fPIE)
++add_compile_options(-fPIE)
list(APPEND CMAKE_PREFIX_PATH ${ROCM_PATH})
list(APPEND CMAKE_PREFIX_PATH "${ROCM_PATH}/lib64/cmake") Is there any way mmwillet/ggml can be updated? I'm somewhat open to merging this PR into TTS.cpp so we can have the small benefit of accessing my recently merged ggml backtrace PR. However, we will need to fork again if there is a need to add or modify HIP/CUDA code. |
I am running to issues with the other models. Specifically, the Kokoro gguf linked in the readme. The forked mmwillet/ggml supports conv_transpose_1d operations with non-zero padding values I think py-gguf/tts_encoders/kokoro_gguf_encoder.py should be adjusted with Bias Tensor Adjustment, that automatically resizes bias tensors when conv_transpose_1d padding changes from non-zero to zero, making kokoro gguf models compatible with ggml main. Thoughts? |
At this point that's the main model and basically the only fast-enough model.
That assertion means "not yet implemented", or else the function parameter
I'd rather not break compatibility with existing files. We don't have versioning and I can't unpublish other people's Hugging Face existing models that are linked in various places. What won't break compatibility is this: We can do any fixup in the |
100%. Dia is painfully slow. I absolutely agree in regards to breaking backwards compatibility. I am trying to figure out a workaround in ggml-tts-ext.h to handle this discrepancy without having to fork and patch ggml. |
@tyrauber Sorry for my absence. Currently, the ggml fork is required, as there are several major modifications made in order to properly support various behaviors. I'll work to upgrade the patch this week and update documentation to reflect this. |
One possible solution to avoid having to patch ggml entirely would be to reimplement added operations as custom map operations. I chose to avoid this initially because the custom map function is limited, has historically been bug prone, and is not compatible with acceleration. Let me see if I can hack together a working work around this week. |
Hey @mmwillet, no worries. Don't mean to create a ton of unnecessary work. Just fascinated by all this and trying to understand. Is there any documentation on the specific requirements that necessitated the ggml fork? I'm trying to understand what functionality is required for TTS that isn't provided in standard ggml. From examining the fork's changes, it appears the main modifications are:
Are these modifications primarily due to:
From my experience with this branch, working around the transpose 1D with non-zero padding is a huge pain in the ass. I tried to code around the issue and ended up with 600 lines of custom functions. Which is almost as bad as a fork. I really think it is important to identify exactly what ggml needs to support tts and get that merged upstream. |
No, I have not put together any documentation on this outside of scattered tickets and/or inline comments. I will do so on the primary README later today. In terms of what the modifications are I think you mostly outlined them, but I'll reiterate and amend here for posterity:
As noted by yourself, the biggest gap stems from the native GGML convolution transposition implementation which doesn't support full parity with Tensorflow or PyTorch. Besides not supporting non-zero padding or output padding, it also doesn't support grouping which is used by Kokoro (though the support that I added for grouping is still somewhat incomplete).
To a certain degree, it is not necessary to support these operations as part of the graph (as I recall they are not implemented as part of the graph in llama.cpp) and it should be possible to support these operations in their current state via GGML's custom_map ops. Acceleration of these operations would however require that they are independently implemented as their own GGML ops (e.g. good threading support and shader support for metal, cuda, etc).
This could currently be a custom map operation and it is unlikely that implementing this as such would result in any significant performance degradation even when accelerating the model via CUDA/metal ops (it is so fast and can just be piped to the cpu with little time loss). My implementation of it is also sloppy. For better parity with PyTorch or Tensorflow, it really should be implemented as a passable option to the generic upscaling operation (along with support for 'bilinear', 'bicubic', 'trilinear', 'area', and 'nearest-exact' modes as well) rather than its own operation.
This and CUDA + other shader-like acceleration protocols was the main long term motivator to implement these as native (rather than custom) operations, but I have failed to make the headway that I expected to.
This is, at least at face value, why these operations exist in some form. In order for the GGML graph to have parity with a pytorch implementation it must support the same underlying behaviors. GGML currently doesn't support these out of the box so I added them. As I noted though, some of these operations could be moved into custom operations or performed via inline transformations and the main reason not to do that is to support better acceleration.
GGML doesn't have good support for convolutional operations which are heavily used in almost every TTS model. Additionally the usage of FFT in TTS is another place where parity is missing, but this has also historically been the case for other tensor libraries (including TF and PyTorch), and custom operations could likely be built. Something additional to note about the lack of convolutional parity is that in my non-OS work I've run into this exact issue with vision models on GGML and have built out bespoke stuff to properly support 2d convolutional operations for my own personal use. I wouldn't be surprised if in the OS space you could find some folks working on vision that have been forced to implement similar functionality to us.
Performance optimizations would mostly be on convolutional operations for the time being (which isn't TTS specific per say). My vision would include better metal and gpu support for FFT, but at this point I'd be happy to pipe that to the cpu.
I think @danielzgtg noted that there are some model encoding choices that may have influenced how we chose to use GGML, but for the most part these things have resulted in minor TTS side code changes. E.G. GGML doesn't have a simple way of performing inline operations against a scalar, so if we need to, for example, multiply a tensor by 2 we might simply add it against itself or perform that operation at gguf encoding.
I agree, this would 100% be the best outcome. |
@mmwillet Thank you for the comprehensive breakdown! This really clarifies the scope and nature of the TTS requirements of ggml. And helps illustrate why the custom fork of ggml is currently a requirement. To summarize. Core upstream ggml changes would be:
My primary target for this library is mobile devices, so I was trying to squeeze as much performance and acceleration out of the library as possible. As I dug into the code, I ran into all these questions. So, I really appreciate you sharing your time, knowledge and experience. I think the documentation you mentioned adding to the README would be incredibly valuable for the community! I really think if these changes could get upstreamed, the library, and the ecosystem as a whole, would benefit tremendously. Happy to close this PR, since my original understanding and assessment was faulty. Hopefully my thoughts and questions were in some way helpful. Thanks again for this library! |
@mmwillet It would be nice if your implementation could be unified with the one in src/whisper.cpp and the result be upstreamed to ggml
Do stride tricks work? There's no speedup for multiplication by 2, but a reciprocal is |
@tyrauber Thanks for your input on this. I am going to close this PR for now, but happy to reopen if our approach changes and we need to return to it. |
@mmwillet, I got a branch where the custom tts ggml methods are extracted to an ext class, but if we are upstreaming - which is definitely the right approach - I don't think you need it. If helpful let me know. When you get around to it, some documentation around tts ggml requirements would be really helpful. I still don't understand why tts requires a zero pad conv. Why did the Dia model work without that? Is that because different tts models have different training, layers, etc. I am sure there is a reason, I just don't grok it yet. If you could braindump how ggml tts works, and why it requires unconventional padding and grouping, I think it would be hugely beneficial. Help contributors get up to speed. Right now, it's a lot of trial and error, and printf. Thank you! Fantastic library. Really appreciate your and @danielzgtg's responsiveness and willingness to discuss and share. Mad respect. |
llama.rn just added TTS multi-modal support using OuteTTS. llama.rn is a react-native wrapper around ggml-org/llama.cpp. llama.cpp is using a custom version of ggml. I think active development of ggml is happening there. The ggml repo mentions "some of the development is currently happening in the llama.cpp and whisper.cpp repos." @mmwillet, @danielzgtg, Have either of you checked to see if llama.cpp/ggml has the required TTS methods? Perhaps, instead of switching to ggml main, it would make sense to switch to llama.cpp, where active development of ggml is happening? |
Ha! Too funny! Got it. So those scripts are keeping whisper.cpp and llama.cpp up-to-date with ggml main. And you are one of the people leading that effort. Awesome. Thank you! So that answers the question. Swapping to llama.cpp wouldn't change anything, because it is current to main. (Would have been nice if it was a submodule so this association was obvious. It was difficult to know if it was a fork, or what version it was.) This leave the same questions asked above though. Without the upstream changes:
Are these core upstream changes outlined above only needed for Kokoro and Parler? |
Correct me if I am wrong, but the ggml fork is not required.
This PR migrates to the official ggml submodule, and extracts the handful of tts methods required to src/ggml-tts-ext.cpp.
Tested on the cli with both variants of the Dia Q4 models you provided.
Now the library can keep current with ggml and benefit from recent changes.
Should help with performance and platform support.
Thank you very much for this library.