llama.cpp merges multi-token prediction for faster local inference
llama.cpp merged pull request 17114 on May 15, adding MTP (Multi-Token Prediction) support to enable faster generation on models trained with multi-token objectives.

llama.cpp merged pull request 17114 on May 15, adding MTP (Multi-Token Prediction) support to the core inference engine. MTP is a technique that predicts multiple tokens per forward pass instead of one, potentially speeding up generation for models trained with multi-token objectives. The feature is now available to anyone building llama.cpp from source.
The implementation includes integration points across the GGML backend without requiring external libraries, keeping the build process unchanged. MTP support means llama.cpp can now run models trained to output multiple tokens simultaneously—a capability that was previously missing from the GGML inference stack.
What stands out
- 01Inference speed potential — MTP reduces the number of forward passes needed for a given output length, which may translate to faster generation on hardware with high memory bandwidth.
- 02Model compatibility — This opens llama.cpp to a new class of models. Any checkpoint trained with multi-token prediction objectives can now run in the GGML ecosystem without conversion workarounds.
- 03No new dependencies — The pull request adds MTP logic to the existing GGML backend without requiring external libraries, keeping the build process unchanged.
- 04Immediate availability — Developers building from the main branch today have MTP support out of the box.
- 05Community-driven — The feature was requested in prior issues. The merge reflects ongoing work to keep llama.cpp competitive with newer inference techniques as they emerge.