RELEX predicts RLVR checkpoints at 15% training cost via rank-1 geometry
New preprint shows reinforcement learning weight updates follow a predictable rank-1 path, enabling a linear-regression method that matches full RLVR performance with 85% fewer steps.

Reinforcement learning with verifiable rewards (RLVR) has become the dominant method for sharpening reasoning in large language models, but a new preprint reveals that the weight updates it produces follow an unexpectedly simple geometry. Researchers from multiple institutions demonstrate that RLVR parameter trajectories are extremely low-rank — so low-rank, in fact, that a single direction captures most of the performance gain.
The team tested their finding on three Qwen models: Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base. They discovered that the magnitude of the rank-1 projection evolves near-linearly with training steps, meaning you can observe a short window of RLVR updates, fit a line, and extrapolate future checkpoints without running the rest of the training loop. The method they propose — RELEX (REinforcement Learning EXtrapolation) — does exactly that: estimate the rank-1 subspace from a brief observation window, then use linear regression to predict checkpoints 10–20× further out.
On benchmarks
RELEX matches or exceeds full RLVR performance on both in-domain and out-of-domain benchmarks while requiring as few as 15% of the training steps. In one experiment, the authors observed only the first 50 steps and successfully extrapolated to step 1000 with continued improvement. Ablation studies confirm that neither raising the subspace rank nor switching to non-linear models yields further gains — the rank-1 linear path is sufficient.
The authors attribute RELEX's success to a "denoising" effect: projecting updates onto the rank-1 subspace filters out stochastic optimization noise that would otherwise degrade extrapolated checkpoints. The preprint and code are both public, and the finding suggests that RLVR training may be far more predictable — and far cheaper — than current practice assumes.