TBQ4 KV cache quantization hits AMD ROCm RDNA3 at 38–54 tok/s

A community fork of llama.cpp brings TBQ4 KV cache quantization and MTP speculative decoding to AMD's RX 7900 XTX, fitting 64k context in 24 GB VRAM at 38-54 tok/s.

May 15, 2026

TBQ4 KV cache quantization hits AMD ROCm RDNA3 at 38–54 tok/s

A community developer has ported TurboQuant's TBQ4 KV cache quantization and MTP speculative decoding to AMD ROCm for RDNA3 GPUs. The experimental branch, tbq4-rdna3-experiment, targets the RX 7900 XTX and other gfx1100 hardware, addressing gaps in AMD's existing llama.cpp paths. The main achievement: running Qwen3.6-27B Q4_K_M with 64k context in under 20 GB VRAM while maintaining 38–54 tok/s generation speed.

The fork integrates TBQ4 dequantization directly into ROCm's VEC Flash Attention kernel. Tests used a 24 GB RX 7900 XTX running Qwen3.6-27B Q4_K_M weights with tbq4_0 KV cache and MTP speculative decoding (--spec-draft-n-max 3). Prefill speed hit 537.7 tok/s at 16k context and 360.8 tok/s at 64k. For comparison, the baseline q8_0 KV cache delivered 49.8 tok/s at 16k and 31 tok/s at 32k, consuming 22–23 GB VRAM.

What stands out

01VRAM savings: TBQ4 KV cache at 64k context uses ~20 GB, roughly 2–3 GB less than q8_0 at 32k context.
02Speed holds up: Generation throughput ranges from 38 to 54 tok/s at 64k context with TBQ4, compared to 22–23 tok/s at 32k with q8_0.
03Prefill remains practical: 360.8 tok/s prefill at 64k context is a 33% drop from the 16k baseline, but still usable for long-context workflows.
04RDNA3.5 and RDNA4 enabled but unverified: The branch includes code paths for newer AMD architectures; no test results yet.
05Other quant schemes present: RotorQuant, PlanarQuant, and IsoQuant are in the codebase but not validated on this hardware.

What stands out

More in Platform