ZenCreator

Pro-grade AI content creation. Image, video, face-swap, lipsync, and upscaling behind one API.

14 tools

Up to 4K

4.4(288)

Visit

Loading…

ReleasesNSFW

DiffusionGemma hits 1,000 tokens/sec with iterative text refinement

Google released DiffusionGemma, a 26-billion-parameter model with 4 billion active parameters that uses diffusion-style iterative refinement to generate text at 1,000 tokens per second on H100 and 700 on RTX 5090.

ByAlex Sokoloff·June 12, 2026

DiffusionGemma hits 1,000 tokens/sec with iterative text refinement

Google released DiffusionGemma this week, a 26-billion-parameter language model with 4 billion active parameters that uses diffusion-style generation to hit 1,000 tokens per second on an H100 GPU. The model generates 256 tokens at once, then refines them through multiple passes—the same iterative approach image diffusion models use to sharpen outputs step by step. On an RTX 5090, DiffusionGemma reaches 700 tokens per second.

DiffusionGemma is built on the Gemma 4 architecture and runs in FP8 precision. Google's benchmark shows it outpaces Gemma 4 with multi-token prediction (MTP) by more than 3× on the same hardware—1,000 tokens per second versus 303. The model is weaker than the standard Gemma 4 on quality benchmarks, which Google frames as expected for a preview of diffusion-based text generation. The model also supports chain-of-thought reasoning.

What stands out

01Speed over quality trade-off. DiffusionGemma generates tokens four times faster than the autoregressive Gemma 4, but quality drops. Google positions this as a research preview—diffusion training for language models is still being refined.
02Iterative token refinement. The model generates 256 tokens in parallel, then regenerates them multiple times to reduce noise. This mirrors how image diffusion models denoise latents, but applied to discrete tokens.
03Broad inference support. Weights are available on HuggingFace. VLLM, Unsloth, and other inference frameworks already support the model. Google also hosts a free code-generation demo that shows real-time token refinement.
04Consumer GPU performance. 700 tokens per second on an RTX 5090 puts high-speed inference within reach of prosumer hardware. The 4-billion active parameter count (out of 26 billion total) keeps memory requirements manageable.

ZenCreator

DiffusionGemma hits 1,000 tokens/sec with iterative text refinement

What stands out

More in Releases

PAJAMA distills LLM judges into programs, cuts eval cost by 100×

Molt: NVIDIA's PyTorch framework cuts agentic RL iteration cost

Hypernetworks outscale LoRA for train-time knowledge injection in LLMs

Staleness-Adaptive Trust Region cuts asynchronous RL performance loss to 3% at 8× policy lag

Distilled RL transfers knowledge across model families without unconditional imitation