Loading…

DPA-GRPO trains paired LLMs to critique and revise structured outputs | UncensoredHub

ResearchNSFW

DPA-GRPO trains paired LLMs to critique and revise structured outputs

A new arXiv preprint introduces DPA-GRPO, a two-player training method that teaches a generator LLM to propose structured outputs and a verifier LLM to raise formal safety cases when errors occur, improving decision accuracy on tax-form tasks.

May 13, 2026

DPA-GRPO trains paired LLMs to critique and revise structured outputs

DPA-GRPO (Dual Paired-Action Group-Relative Policy Optimization) is a reinforcement-learning method that trains LLMs to produce reliable structured outputs through adversarial critique. The system runs a generator–verifier game: the generator proposes an answer to a structured decision task—form filling, compliance checking, maintenance reporting—and may revise it when challenged; the verifier either stays silent or raises a formal safety assurance case (SAC) containing a claim, argument, and evidence. Those paired KEEP/REVISE and SAC/no-SAC decisions create counterfactual action groups, which DPA-GRPO uses for role-specific policy updates under KL regularization.

Existing refinement methods lean on heuristic debate, self-play, or LLM-generated supervision, which the authors argue creates a second-order assurance problem—who checks the checker? DPA-GRPO sidesteps that by requiring the verifier to produce a structured case with explicit evidence, making interventions auditable and grounded in reasoning rather than opaque LLM judgment.

On benchmarks

The authors tested DPA-GRPO on TaxCalcBench TY24, a structured tax-form dataset, using Qwen3-4B and Qwen3-8B models. The method improved structured decision accuracy over zero-shot generation and generator-only RL baselines across both model sizes. Training increased correct silent acceptance—the verifier learned to stay quiet when the generator was right—reduced missed errors, and improved calibrated revision behavior, meaning the generator revised more often when actually wrong and less often when correct. The gains held for both roles, indicating the paired-action updates worked for generator and verifier simultaneously.

The method targets workflows where outputs must be locally correct, globally consistent, and auditable against task-specific rules—settings where a single hallucination or inconsistency can break downstream compliance or safety checks. The preprint is available as arXiv:2605.08327v1.

On benchmarks

More in Research