llama.cpp Docker images bring multi-token prediction to five hardware backends
Community-built Docker images enable multi-token prediction across CUDA, Vulkan, ROCm, and Intel backends while the feature awaits mainline merge.
A community contributor released Docker images for llama.cpp that include multi-token prediction (MTP) support across five hardware backends: CUDA 12, CUDA 13, Vulkan, Intel, and ROCm. Available on Docker Hub under havenoammo/llama, the images let users with existing llama.cpp container setups swap in MTP support without rebuilding from source. The maintainer created them after repeated manual builds became unwieldy; the MTP pull request has since picked up image support and bug fixes, making local inference more stable.
Unsloth released official MTP-enabled GGUF weights for Qwen 3.6 (27B and 35B-A3B) this week. Benchmarks show Unsloth's lower-quantization versions slightly faster than the maintainer's Q8-quantized alternatives—89.14 tokens/sec average versus 88.85—with negligible accuracy loss. Q4 quant hit 94.40 t/s at 97.39 percent MTP layer retention; Q6 reached 83.22 t/s at 97.53 percent. The maintainer now recommends Unsloth's weights, citing speed gains and lower VRAM footprint with no meaningful quality tradeoff.
