Loading…

llama.cpp fork unlocks quantized KV caches on dual GPUs, boosts generation 40% | UncensoredHub

Platform

llama.cpp fork unlocks quantized KV caches on dual GPUs, boosts generation 40%

A community fork of llama.cpp now supports quantized KV caches with tensor-split mode across multiple GPUs, delivering a 40% generation speed increase without quality loss on modest consumer hardware.

May 16, 2026

llama.cpp fork unlocks quantized KV caches on dual GPUs, boosts generation 40%

A GitHub fork of llama.cpp solves a longstanding tensor parallelism bottleneck by enabling quantized KV caches across multiple GPUs for the first time. Released May 17, the branch delivers a 40% speed increase in token generation when splitting workloads across two consumer cards — a 3060 12GB paired with a 4070 Super 12GB.

llama.cpp's --split-mode tensor flag has historically forced users into an unquantized KV cache, bloating VRAM usage and pushing many practitioners to skip tensor parallelism entirely. The new fork patches that limitation with minimal changes to mainline code, allowing Q8_0 KV quantization alongside tensor splitting. On Qwen3.5 27B Q4_K_M, the fork achieves 30 tokens per second generation speed with tensor mode enabled, versus 21 tokens per second on a single GPU — a 42% uplift. Prompt processing drops slightly from 582 to 545 tokens per second, likely due to cross-GPU communication overhead, but the generation gain more than compensates.

Real-world performance gains

The fork runs Qwen3.5 27B at batch size 128 with Q8_0 KV quantization on both key and value caches. In practical "write a story" workloads, the developer reports sustained speeds around 40 tokens per second, up from 25 on a single card. The branch also supports llama.cpp's latest draft-mtp speculative decoding flags, configured with --spec-draft-p-min 0.75 and --spec-draft-n-max 2 for further optimization.

The code is available at github.com/RedToasty/llama.cpp_qts, branched from mainline as of May 17. No upstream pull request has been filed yet; the fork remains a community experiment. Practitioners with dual-GPU setups — especially 12GB + 12GB consumer pairs — benefit most, as quantized caches eliminate the previous trade-off between tensor splitting and usable context length.

Real-world performance gains

More in Platform