llama.cpp MTP draft mode cuts Qwen3.6 latency 20–30% on RTX 5090

A local tester reports that llama.cpp's new multi-token prediction draft mode shaves 20–30 percent off Qwen3.6 generation times on NVIDIA's RTX 5090, using the same GGUF quantization for both runs.

May 16, 2026

llama.cpp MTP draft mode cuts Qwen3.6 latency 20–30% on RTX 5090

Practitioners running llama.cpp on an RTX 5090 this week found that the library's new multi-token prediction (MTP) draft mode cuts Qwen3.6 generation latency by 20 to 30 percent compared to standard autoregressive decoding. The test isolated MTP from quantization effects by loading the same GGUF file for both runs—Unsloth's Qwen3.6-27B-MTP-GGUF Q5_K_M and Qwen3.6-35B-A3B-MTP-GGUF UD-Q4_K_M—and toggling only the --spec-type draft-mtp --spec-draft-n-max 3 flag.

The setup used llama.cpp commit 4f13cb7, 128k context, flash attention, and q8_0 KV cache on a 32 GB RTX 5090 running Linux. Two prompts were tested—a 400-token short story and a 3,000-token Flappy Bird HTML file—each with three seeds averaged. The tester noted that MTP requires --parallel 1, disabling concurrent batch processing. The official llama.cpp Docker image had not yet picked up the MTP merge at the time of testing, requiring a manual build with CUDA_DOCKER_ARCH=120 for Blackwell support.

MTP works by drafting multiple candidate tokens in parallel, then verifying them in a single forward pass. When the draft hits, the model accepts several tokens at once; when it misses, it falls back to the standard single-token step. The 20–30 percent speedup suggests the draft hit rate is high enough to offset the extra compute overhead on this hardware.

More in Platform