Progress advantage extracts step-level rewards from RL post-training without annotation
A new technique derives optimal advantage functions from standard RL policy training, eliminating the need for dedicated reward models in long-horizon agent tasks.

Progress advantage is a step-level scoring method that extracts implicit reward signals directly from reinforcement learning post-training, eliminating the need for human annotation. Researchers derive the technique by computing the log-probability ratio between an RL-trained policy and its reference policy under a stochastic Markov decision process—a formulation that exactly recovers the optimal advantage function. Because it emerges as a byproduct of the standard RL pipeline, it works across domains and task horizons without task-specific training.
Process reward models have become central to evaluating LLM reasoning and agent behavior at the step level, but building them for agentic settings has remained prohibitively expensive. Long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. Progress advantage sidesteps that bottleneck: the log-probability ratio between the RL-trained policy and its reference policy is already computed during standard post-training and can be repurposed as an optimal advantage function without any additional training. The signal is annotation-free, domain-agnostic, and available the moment RL post-training finishes.
Validation spanned three applications—test-time scaling, uncertainty quantification, and failure attribution—across five benchmarks and four model families. In every setting, progress advantage outperformed confidence-based baselines and, despite requiring no task-specific training, surpassed dedicated trained reward models that had been fine-tuned on labeled data for each benchmark.



