Progress advantage extracts step-level rewards from RL post-training without annotation

A new technique derives optimal advantage functions from standard RL policy training, eliminating the need for dedicated reward models in long-horizon agent tasks.

ByAlex Sokoloff·June 27, 2026

Progress advantage extracts step-level rewards from RL post-training without annotation

Progress advantage is a step-level scoring method that extracts implicit reward signals directly from reinforcement learning post-training, eliminating the need for human annotation. Researchers derive the technique by computing the log-probability ratio between an RL-trained policy and its reference policy under a stochastic Markov decision process—a formulation that exactly recovers the optimal advantage function. Because it emerges as a byproduct of the standard RL pipeline, it works across domains and task horizons without task-specific training.

Process reward models have become central to evaluating LLM reasoning and agent behavior at the step level, but building them for agentic settings has remained prohibitively expensive. Long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. Progress advantage sidesteps that bottleneck: the log-probability ratio between the RL-trained policy and its reference policy is already computed during standard post-training and can be repurposed as an optimal advantage function without any additional training. The signal is annotation-free, domain-agnostic, and available the moment RL post-training finishes.

Validation spanned three applications—test-time scaling, uncertainty quantification, and failure attribution—across five benchmarks and four model families. In every setting, progress advantage outperformed confidence-based baselines and, despite requiring no task-specific training, surpassed dedicated trained reward models that had been fine-tuned on labeled data for each benchmark.

ZenCreator

Progress advantage extracts step-level rewards from RL post-training without annotation

More in Research

DeepSeek v4 full release set for mid-July with peak-hour pricing doubled

Qwen3-ASR hits state-of-the-art on 30 languages with 2000× throughput at 0.6B

OTUS free RAG workshop teaches enterprise support teams document retrieval on July 6

ComfyUI MCP server lets AI agents control workflows with plain-text prompts

DreamForge-World 0.1 Preview reaches 15 FPS interactive simulation on single RTX 4090