SAGE method lifts pass@k in reasoning model training by reshaping KL anchors
A new arXiv preprint introduces SAGE, a method that addresses exploration limits in reinforcement learning with verifiable rewards by modifying the reverse-KL anchor rather than removing regularization entirely.
Reinforcement learning with verifiable rewards improves a model's best answer on reasoning tasks but often fails to widen the diversity of correct solutions it can produce—a gap that suggests the training merely optimizes sampling efficiency rather than teaching new reasoning modes.
SAGE, a framework detailed in a preprint posted to arXiv on May 20, 2026, targets that structural bottleneck. The authors argue that reverse-KL regularization—the term that keeps training stable by anchoring the policy to a reference distribution—inherently suppresses the emergence of alternative reasoning paths. This constraint matters because RLVR has become a standard post-training technique for open-weight reasoning models, yet practitioners routinely observe that pass@1 scores climb while pass@k metrics plateau. The implication is that the model learns to sample its existing reasoning modes more efficiently without discovering fundamentally new solution strategies.
Removing the KL term or swapping it for forward-KL destabilizes the efficiency-coverage trade-off. The first approach invites reward hacking, where the model exploits verifier weaknesses to claim credit for invalid reasoning. The second spreads probability mass into off-target regions, diluting the model's ability to produce any correct answer at all. Both paths sacrifice the reliability that makes RLVR viable in production.
The proposed solution reshapes the anchor distribution itself through a guide function q(x,y), allowing controlled expansion of the model's empirical support without discarding the stabilizing properties of reverse-KL. In effect, SAGE shifts the reference point the policy is anchored to, rather than cutting the anchor line entirely. Evaluated on mathematical reasoning benchmarks, the method delivers consistent gains in both pass@1—the likelihood of a correct first answer—and pass@k, the likelihood of finding a correct answer in k attempts. That dual improvement suggests the training is genuinely broadening the model's reasoning repertoire rather than merely reweighting existing modes.
The code is available at https://github.com/tally0818/SAGE.
