Qwen 3.6 27B hits 34 tokens/s on M5 Max with multi-token prediction and TurboQuant
A patched llama.cpp build with multi-token prediction and TurboQuant quantization delivers 40% faster inference on Apple Silicon, pushing Qwen 3.6 27B from 21 to 34 tokens per second on a MacBook Pro M5 Max.
A patched build of llama.cpp now runs Qwen 3.6 27B at 34 tokens per second on a MacBook Pro M5 Max with 64GB RAM, up from 21 tokens per second with TurboQuant alone. The 40% speed gain comes from multi-token prediction (MTP), a speculative-decoding technique that guesses multiple tokens ahead and achieves a 90% acceptance rate on Qwen models. The fork combines MTP with TurboQuant, a KV-cache quantization method that compresses memory overhead during inference. Quantized GGUF weights for Qwen 3.6 27B and 35B with MTP support are available on HuggingFace, and the patched llama.cpp source lives on GitHub.
