Sumi 7B uniform diffusion model trains from scratch on 1.5 trillion tokens
Researchers released Sumi, a 7B-parameter uniform diffusion language model trained from scratch on 1.5 trillion tokens, filling a gap in open research on non-autoregressive architectures.

Sumi is a 7-billion-parameter uniform diffusion language model trained from scratch on 1.5 trillion tokens. The model, whose name means "ink" in Japanese, is the first uniform diffusion language model (UDLM) pretrained at both large parameter scale and large token budget, according to a preprint released this week.
Uniform diffusion language models permit any token in a sequence to be updated at any step during generation, enabling more flexible generation than autoregressive models that produce tokens strictly left-to-right. While autoregressive models and masked diffusion models already have capable open implementations at scale, uniform diffusion has had none—until now.
What stands out
- 017B parameters, 1.5T tokens. Sumi was trained from scratch on a token budget comparable to recent open autoregressive models, providing a clean reference point for studying scaling behavior in uniform diffusion.
- 02Competitive on knowledge, reasoning, and coding. The model performs on par with autoregressive models trained at similar token budgets on knowledge-intensive, reasoning, and coding benchmarks.
- 03Weaker on commonsense tasks. Sumi under-performs on commonsense benchmarks, which the authors attribute to an education-heavy data mixture that may not reflect everyday reasoning patterns.
- 04Fully open release. The team released model weights, training checkpoints, and the complete training recipe, including a detailed specification of the data mixture over publicly available corpora.
- 05Research catalyst. The release is intended to enable the community to study native uniform diffusion at scale and investigate aspects of the architecture that remain poorly understood, including generation dynamics, controllability, and trade-offs against established paradigms.



