Qwen3.6 27B hits 50 tok/s on 7900XTX with Q8 cache quantization

A LocalLLaMA user achieved 50 tokens per second on AMD's 7900XTX by quantizing the KV cache to Q8, fitting Qwen3.6 27B in 24GB VRAM at 64k context—a 2.2× speedup over the unquantized baseline.

May 16, 2026

Qwen3.6 27B hits 50 tok/s on 7900XTX with Q8 cache quantization

A user running Qwen3.6 27B on an AMD Radeon 7900XTX reported a 2.2× speed boost after switching from an unquantized cache to Q8 cache quantization. The setup uses llama.cpp's MTP (multi-token prediction) speculative draft feature, which generates up to three draft tokens per step to accelerate inference.

The baseline config—Q4_K_M weights, 64k context, no cache quantization—maxed out VRAM at 93 percent and delivered 22.7 tokens per second. Switching to Q8 cache dropped VRAM usage enough to fit the entire model and context window in the card's 24GB, lifting throughput to 50 tok/s. The user had been running Qwen3.6 35B A3B, a mixture-of-experts checkpoint, at 128k context before the test.

What stands out

01Q8 cache cuts VRAM pressure. The Q4_K_M weights alone consumed most of the 24GB at 64k context; quantizing the cache freed enough headroom to keep the model resident and doubled speed.
02MTP draft adds overhead at low VRAM. The --spec-type draft-mtp --spec-draft-n-max 3 flags enable speculative decoding, but the feature's benefit depends on having spare VRAM for the draft model—cache quantization made that possible.
03Vulkan backend on AMD. The build uses Vulkan rather than ROCm, a common choice for consumer Radeon cards where ROCm driver support lags.
0450 tok/s is usable for chat. The final speed matches or exceeds typical typing rates, making the 27B dense model competitive with the 35B MoE for interactive use.
05Context size trades off. The user noted the MoE ran at 128k context; the dense 27B at Q4 + Q8 cache holds 64k, half the prior window but still well above most chat session lengths.

What stands out

More in Community