Loading…

Image Video Prompts Gallery Battles News Agents About

Terms Privacy Cookies DMCA

18+ · Adults only · Not affiliated with hosted platforms

Image Video Prompts Gallery Battles News

llama.cpp Docker images bring multi-token prediction to five hardware backends | UncensoredHub

← All news
·
Platform

PlatformNSFW

llama.cpp Docker images bring multi-token prediction to five hardware backends

Community-built Docker images enable multi-token prediction across CUDA, Vulkan, ROCm, and Intel backends while the feature awaits mainline merge.

May 15, 2026

llama.cpp Docker images bring multi-token prediction to five hardware backends

A community contributor released Docker images for llama.cpp that include multi-token prediction (MTP) support across five hardware backends: CUDA 12, CUDA 13, Vulkan, Intel, and ROCm. Available on Docker Hub under havenoammo/llama, the images let users with existing llama.cpp container setups swap in MTP support without rebuilding from source. The maintainer created them after repeated manual builds became unwieldy; the MTP pull request has since picked up image support and bug fixes, making local inference more stable.

Unsloth released official MTP-enabled GGUF weights for Qwen 3.6 (27B and 35B-A3B) this week. Benchmarks show Unsloth's lower-quantization versions slightly faster than the maintainer's Q8-quantized alternatives—89.14 tokens/sec average versus 88.85—with negligible accuracy loss. Q4 quant hit 94.40 t/s at 97.39 percent MTP layer retention; Q6 reached 83.22 t/s at 97.53 percent. The maintainer now recommends Unsloth's weights, citing speed gains and lower VRAM footprint with no meaningful quality tradeoff.

ByAlex Sokoloff·AI enthusiast·MSc Computer Science

More in Platform