Missing old logits break asynchronous PPO for LLM agents, new paper proposes three repair strategies

A preprint on arXiv identifies a semantic mismatch in asynchronous reinforcement learning pipelines that train large language model agents, and offers exact and approximate fixes to decouple staleness correction from training–inference discrepancy.

May 12, 2026

Missing old logits break asynchronous PPO for LLM agents, new paper proposes three repair strategies

Asynchronous reinforcement learning speeds up training for large language model agents by generating rollout samples on one set of workers while a separate optimizer updates the policy. That decoupling boosts throughput, but a new preprint from Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, and Xiong Jun Wu shows it also introduces a silent failure mode in PPO-style off-policy correction. Posted to arXiv on May 13, the paper argues that practical asynchronous pipelines lose the historical training-side logits—what the authors call "old logits"—needed to properly decompose the importance ratio into a training–inference discrepancy term and a policy-staleness term. Without those old logits, the two correction factors become entangled, clipping and masking thresholds interact undesirably, and the intended semantics of decoupled correction collapse.

The authors propose three exact strategies to recover the missing old logits: snapshot-based version tracking that checkpoints policy states at rollout time, a dedicated old-logit model that runs in parallel to generate the required reference distributions, and synchronization via partial rollout interruption that pauses inference workers to align policy versions. Each route carries different system trade-offs—snapshot tracking adds storage overhead, the dedicated model doubles inference compute, and interruption sacrifices throughput. For cases where exact recovery is too expensive, the paper examines approximate correction that preserves the benefits of decoupled semantics by choosing a more appropriate surrogate policy, avoiding the need for extra infrastructure.

Following that analysis, the team adopts a revised PPO-EWMA method that blends exponentially weighted moving averages into the correction step. The approach delivers measurable gains in both training speed and optimization performance across the experiments reported in the preprint. Code implementing the methods is available at github.com/millioniron/ROLL. The next step is to see whether the snapshot or dedicated-model route proves cheaper at scale in production agentic systems, and whether the approximate correction holds up when context windows and rollout lengths grow beyond the paper's test range.

More in Research