Qwen 3.7-27B MTP draft cache quantization frees VRAM with no speed loss
A llama.cpp user reports that quantizing the MTP layer's KV cache to Q8_0 preserves draft acceptance rates while freeing VRAM for longer context windows on Qwen 3.5/3.6 models.

Qwen 3.5 and 3.6 models running in llama.cpp with the MTP (multi-token prediction) layer consume extra VRAM for a dedicated KV cache, but that cache can be quantized just like the main model's cache. Testing the -cache-type-k-draft q8_0 -cache-type-v-draft q8_0 flags on Qwen3.7-27B-Q8_0 showed no loss in draft acceptance rate or wall-clock speed.
Benchmarks on dual AMD MI50 cards (32GB each, PCIe 4.0 x8) showed a 73.5 percent draft acceptance rate across nine requests whether the MTP cache ran at full precision or Q8_0, with wall time holding steady at 49.3–49.5 seconds in single-GPU mode and 38.3–38.4 seconds under tensor parallelism. The quantized cache freed several gigabytes of VRAM without changing the number of accepted draft tokens. The MTP layer generates up to three speculative tokens per forward pass; the main model then accepts or rejects them in parallel. Across 1,404 predicted tokens, 957 came from accepted drafts—a ratio that remained stable after quantization, suggesting the technique could let memory-constrained setups extend context windows without sacrificing inference speed.