Looks interesting! Is there any intuition for why this should be the case? Did you discover it via that intuition, or just random experimentation?
A note, your install script appears to still have a placeholder at the "apply patch" step. A suggestion, might be more user-friendly to fork llama.cpp and then include that as a git submodule rather than make it a "git clone and apply patch" step.
A further note, everyone and their dog has a different local python set-up, might be nice to let people separate the llama.cpp stuff from the python stuff rather than bake in a dependence on homebrew python.
show comments
Aurornis
I finally had time to read the code. The patch is unnecessary because this functionality has been in llama.cpp since 2023 if I understand this PR correctly: https://github.com/ggml-org/llama.cpp/pull/4312
Instead of offering a forked llama.cpp with the changes applied as commits, the repo wants you to run an `install.sh` script which checks out the master branch of llama.cpp without specifying a revision, then applies a short patch to it. This alone should be a warning flag that something is amiss.
There are 4 different patch files in the repo and 1 extra version of the patch as a Heredoc embedded in the install script for some reason. The script has two different versions of code to clone the repo and attempt the patch, too.
The install.sh script overwrites one of the patch files with another patch file with this line:
The only thing it adds is a "--kvq" argument which is supposed to let you set K and V quantization at the same time, but immediately above it are the already built-in arguments for setting the K and V quantization separately. Surely the author must have noticed the functionality already existed at some point while shuffling these patches around?
I strongly recommend that people do not run shell scripts from new repos like this, especially when the shell script is so convoluted.
The HN post has 200+ upvotes and the GitHub repo has collected 200+ stars and climbing at this point, but I think the content is misleading. The flagged-to-death comment in this thread calling out the problem was actually correct. It's also concerning that the author continues to respond to this thread but is avoiding any questions about the functionality already existing.
EDIT: I misread the shell script. I think it actually applies this patch: https://github.com/dipampaul17/KVSplit/blob/main/patch/fixed... After applying the patch it mysteriously overwrites the fixed_kv_patch.diff patch with the split_kv_quant.diff file but then does nothing with it. I don't know if this is the result of vibecoding or just someone carelessly editing code, but I'll reiterate that nobody should run shell scripts like this from unknown repos.
EDIT 2: I'm even more confused now. The install.sh script references the old URL for the llama.cpp repo ( https://github.com/ggerganov/llama.cpp ) which now redirects because it was changed some time ago. The patches attempt to modify arg parsing in common.cpp, but that code was moved to arg.cpp 8 months ago ( https://github.com/ggml-org/llama.cpp/commit/bfe76d4a17228bf... ). So this install script and repo appear to be based on code from ~2024 using options added to llama.cpp in ~2023. What is going on here?
show comments
behnamoh
Is this patch possible to do on MLX? I'm getting better speeds on MLX. That, combined with your approach, would finally let Mac users have long conversations at usable speeds.
show comments
ondra
Is this any different from using --cache-type-k and --cache-type-v?
show comments
badmonster
I'm curious: is it possible to apply differentiated KV quantization (like K8V4) to models after they're already converted to .gguf format, or does this require rebuilding the model with special support? If it's compatible with any .gguf file, are there any limitations on model types (e.g. Mistral, Phi-3, etc.) or tokenizer configs?
show comments
entrepy123
Are these significantly faster/better on 64GB or 128GB Apple silicon (over 36GB or 48GB)?
I've been reading that large contexts and large models are just painfully slow, even on the fastest and largest Apple silicon that money can buy.
So I wonder if this helps make more use of greater memory, or if really smallish models are still where it's at for Apple silicon, practically speaking.
show comments
nico
Great work. This seems very interesting, but I need something slightly more high level to relate to it
Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window? Or a 128k model (like gemma3) with a 256k+ context window?
What’s the ideal use case for local models?
Thank you
show comments
3abiton
This is a brilliant idea, and initiative. Does this also apply to GPUs? And I assume should be compatible with other quantization techniques, albeit they probably require their own patches?
show comments
zmmmmm
Amazing!
Curious, what happens to performance? I assume you still pay the same performance price for longer context, even if you can now fit it in memory.
show comments
smcleod
+0.86% perplexity it's quite a bit at such a small context size though isn't it? How is it at more reasonable context sizes like 64-128k?
show comments
segmondy
you can do this already with -ctk and -ctv, why would anyone need this?
-ctk, --cache-type-k TYPE KV cache data type for K
Am I missing something? As far as I can see this patch does nothing except add new options that replicate the functionality of the existing --cache-type-k and --cache-type-v options.
Using `--flash-attn --cache-type-k q8_0 --cache-type-v q8_0` is a very well known optimization to save VRAM.
And it's also very well known that the keys are more sensitive to quantization than values. E.g. https://arxiv.org/abs/2502.15075
Looks interesting! Is there any intuition for why this should be the case? Did you discover it via that intuition, or just random experimentation?
A note, your install script appears to still have a placeholder at the "apply patch" step. A suggestion, might be more user-friendly to fork llama.cpp and then include that as a git submodule rather than make it a "git clone and apply patch" step.
A further note, everyone and their dog has a different local python set-up, might be nice to let people separate the llama.cpp stuff from the python stuff rather than bake in a dependence on homebrew python.
I finally had time to read the code. The patch is unnecessary because this functionality has been in llama.cpp since 2023 if I understand this PR correctly: https://github.com/ggml-org/llama.cpp/pull/4312
Instead of offering a forked llama.cpp with the changes applied as commits, the repo wants you to run an `install.sh` script which checks out the master branch of llama.cpp without specifying a revision, then applies a short patch to it. This alone should be a warning flag that something is amiss.
There are 4 different patch files in the repo and 1 extra version of the patch as a Heredoc embedded in the install script for some reason. The script has two different versions of code to clone the repo and attempt the patch, too.
The install.sh script overwrites one of the patch files with another patch file with this line:
> cp patch/split_kv_quant.diff patch/fixed_kv_patch.diff
So the `fixed_kv_patch.diff` that is checked into the repo gets overwritten before being applied.
As far as I can tell, this is therefore the patch it's supposed to use: https://github.com/dipampaul17/KVSplit/blob/main/patch/split... (EDIT: I think it's actually this one, see comment at the end: https://github.com/dipampaul17/KVSplit/blob/main/patch/fixed... )
The only thing it adds is a "--kvq" argument which is supposed to let you set K and V quantization at the same time, but immediately above it are the already built-in arguments for setting the K and V quantization separately. Surely the author must have noticed the functionality already existed at some point while shuffling these patches around?
I strongly recommend that people do not run shell scripts from new repos like this, especially when the shell script is so convoluted.
The HN post has 200+ upvotes and the GitHub repo has collected 200+ stars and climbing at this point, but I think the content is misleading. The flagged-to-death comment in this thread calling out the problem was actually correct. It's also concerning that the author continues to respond to this thread but is avoiding any questions about the functionality already existing.
EDIT: I misread the shell script. I think it actually applies this patch: https://github.com/dipampaul17/KVSplit/blob/main/patch/fixed... After applying the patch it mysteriously overwrites the fixed_kv_patch.diff patch with the split_kv_quant.diff file but then does nothing with it. I don't know if this is the result of vibecoding or just someone carelessly editing code, but I'll reiterate that nobody should run shell scripts like this from unknown repos.
EDIT 2: I'm even more confused now. The install.sh script references the old URL for the llama.cpp repo ( https://github.com/ggerganov/llama.cpp ) which now redirects because it was changed some time ago. The patches attempt to modify arg parsing in common.cpp, but that code was moved to arg.cpp 8 months ago ( https://github.com/ggml-org/llama.cpp/commit/bfe76d4a17228bf... ). So this install script and repo appear to be based on code from ~2024 using options added to llama.cpp in ~2023. What is going on here?
Is this patch possible to do on MLX? I'm getting better speeds on MLX. That, combined with your approach, would finally let Mac users have long conversations at usable speeds.
Is this any different from using --cache-type-k and --cache-type-v?
I'm curious: is it possible to apply differentiated KV quantization (like K8V4) to models after they're already converted to .gguf format, or does this require rebuilding the model with special support? If it's compatible with any .gguf file, are there any limitations on model types (e.g. Mistral, Phi-3, etc.) or tokenizer configs?
Are these significantly faster/better on 64GB or 128GB Apple silicon (over 36GB or 48GB)?
I've been reading that large contexts and large models are just painfully slow, even on the fastest and largest Apple silicon that money can buy.
So I wonder if this helps make more use of greater memory, or if really smallish models are still where it's at for Apple silicon, practically speaking.
Great work. This seems very interesting, but I need something slightly more high level to relate to it
Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window? Or a 128k model (like gemma3) with a 256k+ context window?
What’s the ideal use case for local models?
Thank you
This is a brilliant idea, and initiative. Does this also apply to GPUs? And I assume should be compatible with other quantization techniques, albeit they probably require their own patches?
Amazing!
Curious, what happens to performance? I assume you still pay the same performance price for longer context, even if you can now fit it in memory.
+0.86% perplexity it's quite a bit at such a small context size though isn't it? How is it at more reasonable context sizes like 64-128k?
you can do this already with -ctk and -ctv, why would anyone need this?
-ctk, --cache-type-k TYPE KV cache data type for K
-ctv, --cache-type-v TYPE KV cache data type for VAm I missing something? As far as I can see this patch does nothing except add new options that replicate the functionality of the existing --cache-type-k and --cache-type-v options.
Using `--flash-attn --cache-type-k q8_0 --cache-type-v q8_0` is a very well known optimization to save VRAM.
And it's also very well known that the keys are more sensitive to quantization than values. E.g. https://arxiv.org/abs/2502.15075