Multi-Block Diffusion Language Models nearly triple token throughput with concurrent decoding
MBD-LMs post-train block diffusion models with Multi-block Teacher Forcing, raising tokens per forward pass from 3.47 to 9.34 when combined with DMax, while maintaining accuracy on math and code benchmarks.

Multi-Block Diffusion Language Models (MBD-LMs) extend single-block diffusion text generation to concurrent multi-block decoding by addressing a training-inference mismatch in existing Block Diffusion Language Models. Those models train under teacher forcing on one noisy block conditioned on a clean prefix, but at inference they decode a bounded running-set of blocks with heterogeneous noise patterns. MBD-LMs close that gap with Multi-block Teacher Forcing (MultiTF), which trains on bounded noise-groups conditioned on clean prefixes and uses randomized noise-schedulers that better match real decoding states. An optimized Block Buffer decoding algorithm preserves prefix-cache reuse, keeps input shapes static, and translates increased parallelism into measurable wall-clock speedup.
On the MBD-LLaDA2-Mini checkpoint, average Tokens Per Forward pass (TPF) climbs from 3.47 to 6.19 and average accuracy rises from 79.95% to 81.03%. When combined with DMax—a separate decoding strategy—MBD-LLaDA2-Mini-DMax reaches 9.34 TPF with only a 1.02% accuracy drop on math and code benchmarks. The preprint was posted to arXiv on July 1, 2026.




