AlphaGRPO teaches multimodal models to reason and self-correct image generation
New reinforcement learning method enables unified multimodal models to reason about user intent and autonomously fix misalignments in generated images without separate training stages.

AlphaGRPO, a reinforcement learning framework from researchers at the University of Hong Kong and Zhejiang University, trains unified multimodal models to reason about implicit user intent and self-correct generation errors. The method applies Group Relative Policy Optimization to autoregressive-diffusion models, enabling two capabilities without a cold-start phase: reasoning text-to-image generation, where the model infers what a user actually wants from ambiguous prompts, and self-reflective refinement, where it diagnoses and fixes its own output misalignments.
The framework introduces Decompositional Verifiable Reward (DVReward), a supervision signal that replaces scalar reward scores with atomic, verifiable questions. An LLM breaks down complex user requests into semantic and quality checks—"Does the image contain a red car?" rather than "Rate this image 1-10"—which a multimodal LLM then evaluates. This decomposition yields more stable and interpretable feedback than holistic scoring.
Benchmark results show gains across GenEval, TIIF-Bench, DPG-Bench, and WISE. The model also improved on the GEdit editing benchmark despite never training on editing tasks, suggesting the self-reflective approach transfers to related generation problems. The preprint is available at arxiv.org/abs/2605.12495.