NEON implementation for trellis quants #471
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Alternative to #460
One wouldn't really want to use this on a NEON CPU as it is much too slow. But for the sake of completeness, here it is.
Sweep bench results for LLaMA-3.1-8B-Instruct with BLAS on M2-Max CPU (PP performance is much lower without BLAS)
IQ2_KT
IQ3_KT
IQ4_KT
This is nevertheless quite a bit faster than #460, so I'll go with this PR.
Of note: I couldn't make
IQ4_KT
work withfp16
arithmetic for some reason. Not sure if there really isfp16
range overflow, or if I just have a bug in thefp16
implementation that I simply cannot see.