Entrocraft rejection sampling pushes 4B LLM past 8B baseline in RL training
A new preprint introduces Entrocraft, a rejection-sampling method that controls entropy curves during reinforcement learning, letting a 4-billion-parameter model outperform an 8-billion-parameter baseline and sustaining gains four times longer.

Reinforcement learning for large language models hits a wall when entropy collapses — the moment exploration stops and training plateaus. A preprint posted to arXiv this week argues that existing fixes, which rely on regularization or clipping, produce unstable entropy curves that still choke off long-term gains.
Entrocraft, from researchers at Purdue and the University of Maryland, takes a different path: rejection sampling that biases advantage distributions to follow a user-defined entropy schedule. The method requires no objective regularization and works with any advantage estimator. The authors relate per-step entropy change to the advantage distribution under minimal assumptions, explaining why prior RL and entropy-preserving methods behave the way they do. A systematic study of entropy schedules finds that linear annealing — starting high and decaying to a slightly lower target — performs best.
In empirical tests, Entrocraft lets a 4-billion-parameter model outperform an 8-billion-parameter baseline, sustains improvement for up to four times longer before plateauing, and raises pass@K by 50 percent over the baseline. The paper is available on arXiv (2604.26326). Authors are Bolian Li, Yifan Wang, Yi Ding, Anamika Lochab, Ananth Grama, and Ruqi Zhang.