DiffusionGemma hits 1,000 tokens/sec with iterative text refinement
Google released DiffusionGemma, a 26-billion-parameter model with 4 billion active parameters that uses diffusion-style iterative refinement to generate text at 1,000 tokens per second on H100 and 700 on RTX 5090.
Google released DiffusionGemma this week, a 26-billion-parameter language model with 4 billion active parameters that uses diffusion-style generation to hit 1,000 tokens per second on an H100 GPU. The model generates 256 tokens at once, then refines them through multiple passes—the same iterative approach image diffusion models use to sharpen outputs step by step. On an RTX 5090, DiffusionGemma reaches 700 tokens per second.
DiffusionGemma is built on the Gemma 4 architecture and runs in FP8 precision. Google's benchmark shows it outpaces Gemma 4 with multi-token prediction (MTP) by more than 3× on the same hardware—1,000 tokens per second versus 303. The model is weaker than the standard Gemma 4 on quality benchmarks, which Google frames as expected for a preview of diffusion-based text generation. The model also supports chain-of-thought reasoning.
What stands out
- 01Speed over quality trade-off. DiffusionGemma generates tokens four times faster than the autoregressive Gemma 4, but quality drops. Google positions this as a research preview—diffusion training for language models is still being refined.
- 02Iterative token refinement. The model generates 256 tokens in parallel, then regenerates them multiple times to reduce noise. This mirrors how image diffusion models denoise latents, but applied to discrete tokens.
- 03Broad inference support. Weights are available on HuggingFace. VLLM, Unsloth, and other inference frameworks already support the model. Google also hosts a free code-generation demo that shows real-time token refinement.
- 04Consumer GPU performance. 700 tokens per second on an RTX 5090 puts high-speed inference within reach of prosumer hardware. The 4-billion active parameter count (out of 26 billion total) keeps memory requirements manageable.







