DACA-GRPO: per-step credit assignment cuts diffusion LLM training bias by up to 36 points
A new arXiv preprint introduces Denoising-Aware Credit Assignment for GRPO, a reinforcement learning method that weights denoising steps by importance and reduces mean-field bias, lifting diffusion language model performance by up to 36 points on constraint tasks.
Diffusion large language models generate text by iteratively denoising a sequence, but existing reinforcement learning methods treat every denoising step as equally important and rely on biased likelihood estimates that introduce high variance. Researchers have now published DACA-GRPO (Denoising-Aware Credit Assignment for GRPO), a lightweight enhancement that slots into any GRPO-style trainer to address both problems without architectural changes.
The method introduces two complementary mechanisms. Denoising Progress Scores extract per-token importance weights from intermediate predictions at no extra forward-pass cost, assigning credit across the denoising trajectory. Stratified Masking Likelihood partitions token positions into strata so each token is predicted with most of the sequence as context, cutting the mean-field bias that inflates variance in policy gradients. Applied on top of three GRPO base methods, DACA-GRPO delivered gains of 5.6 percentage points on math reasoning, 7.4 points on code generation, 36.3 points on constraint satisfaction, and 5.9 points on JSON schema adherence across seven benchmarks spanning mathematical reasoning, code generation, constraint satisfaction, and constrained generation.
The preprint does not name the base diffusion models or release training code. The authors frame DACA-GRPO as a plug-and-play component, suggesting it could generalize to other diffusion LLM architectures. Open-weight implementations and reproducible training recipes remain the next milestone—diffusion models are less mature than autoregressive counterparts in the open-source ecosystem, and practitioners will want to see whether the credit-assignment gains hold at larger scale and across more diverse prompting distributions.
