llada.cpp cuts diffusion LLM latency 17x–42x on smartphone NPUs
New framework exploits mobile neural processors to accelerate parallel-token denoising models, shrinking LLaDA-8B generation time by up to 42x over CPU baselines.

llada.cpp is an NPU-aware inference framework that accelerates diffusion large language models on smartphones by up to 42 times faster than CPU execution. Diffusion LLMs denoise multiple tokens in parallel rather than generating one at a time, but repeated denoising passes create heavy computational loads that mobile neural processing units struggle to handle efficiently. The framework addresses three bottlenecks: shrinking workloads as tokens commit in late-stage decoding, KV cache complications when tokens revise, and costly data remapping between the CPU and the NPU's limited address space.
The system introduces multi-block speculative decoding to fill idle NPU cycles by predicting future-block tokens while current blocks finish, dual-path progressive revision to keep unstable tokens refreshable on the CPU without stalling dense matrix operations on the NPU, and swap-optimized memory runtime that compacts address layouts and overlaps data staging with computation. Evaluations across diverse hardware and workloads show llada.cpp reduces LLaDA-8B generation latency by 17x to 42x over CPU baselines when prefix KV cache reuse is enabled, with no degradation in output quality.



