DiffusionGemma-26B matches autoregressive Gemma on medical VQA, decodes 3.5× faster
A diffusion language model fine-tuned for radiology reports performs on par with its autoregressive sibling while offering bidirectional infill and faster inference.

A preprint published this week demonstrates that diffusion language models can match autoregressive performance on medical visual question answering while unlocking editing workflows that left-to-right generation cannot support. Researchers fine-tuned DiffusionGemma-26B, a mixture-of-experts diffusion model, and benchmarked it against Gemma-4-26B using an identical LoRA recipe on medical VQA datasets. Both models were scored by an LLM judge designed to handle verbosity differences.
Diffusion matched or exceeded the autoregressive baseline across all datasets. The fine-tuned model activates 3.8 billion parameters and decodes 3.5 to 4.4 times faster than its autoregressive counterpart. Despite its smaller active parameter count, the diffusion model is competitive with frontier vision-language models.
What stands out
- 01Bidirectional infill. Because diffusion denoises a token canvas from both directions, a radiologist can anchor fragments of a report and have the model fill the gaps between them. Autoregressive models generate left to right and handle infill poorly.
- 02Speed advantage. The diffusion model decodes 3.5 to 4.4 times faster than the autoregressive Gemma-4-26B, a meaningful gain for interactive drafting workflows.
- 03Medical VQA parity. Diffusion matched or beat autoregressive performance on every medical visual question answering dataset tested, using the same LoRA fine-tuning recipe and model size.
- 04Sparse activation. The 26-billion-parameter mixture-of-experts architecture activates only 3.8 billion parameters per forward pass, keeping inference costs low while maintaining competitive accuracy.
- 05



