ZenCreator

Pro-grade AI content creation. Image, video, face-swap, lipsync, and upscaling behind one API.

14 tools

Up to 4K

4.4(288)

Visit

Loading…

RELEX predicts RLVR checkpoints at 15% training cost via rank-1 geometry | UncensoredHub

ReleasesResearch

RELEX predicts RLVR checkpoints at 15% training cost via rank-1 geometry

New preprint shows reinforcement learning weight updates follow a predictable rank-1 path, enabling a linear-regression method that matches full RLVR performance with 85% fewer steps.

ByAlex Sokoloff·May 18, 2026

RELEX predicts RLVR checkpoints at 15% training cost via rank-1 geometry

Reinforcement learning with verifiable rewards (RLVR) has become the dominant method for sharpening reasoning in large language models, but a new preprint reveals that the weight updates it produces follow an unexpectedly simple geometry. Researchers from multiple institutions demonstrate that RLVR parameter trajectories are extremely low-rank — so low-rank, in fact, that a single direction captures most of the performance gain.

The team tested their finding on three Qwen models: Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base. They discovered that the magnitude of the rank-1 projection evolves near-linearly with training steps, meaning you can observe a short window of RLVR updates, fit a line, and extrapolate future checkpoints without running the rest of the training loop. The method they propose — RELEX (REinforcement Learning EXtrapolation) — does exactly that: estimate the rank-1 subspace from a brief observation window, then use linear regression to predict checkpoints 10–20× further out.

On benchmarks

RELEX matches or exceeds full RLVR performance on both in-domain and out-of-domain benchmarks while requiring as few as 15% of the training steps. In one experiment, the authors observed only the first 50 steps and successfully extrapolated to step 1000 with continued improvement. Ablation studies confirm that neither raising the subspace rank nor switching to non-linear models yields further gains — the rank-1 linear path is sufficient.

The authors attribute RELEX's success to a "denoising" effect: projecting updates onto the rank-1 subspace filters out stochastic optimization noise that would otherwise degrade extrapolated checkpoints. The preprint and code are both public, and the finding suggests that RLVR training may be far more predictable — and far cheaper — than current practice assumes.

ZenCreator

RELEX predicts RLVR checkpoints at 15% training cost via rank-1 geometry

On benchmarks

More in Releases

Avito launches year-long Data Science Bootcamp with ML and NLP tracks

HuggingFace Jobs launches one-command vLLM deployment on H100 and A100

Gemma 4 voice AI hits sub-100ms latency on Cerebras wafer-scale chips

Hugging Face embeds 200+ benchmark scores directly on model cards

NVIDIA NeMo AutoModel cuts fine-tuning setup time for Llama, Mistral, Gemma