FEST cuts LLM reasoning training data needs to 128 demonstrations
New few-shot RLVR method matches full-dataset performance with 99% less supervised fine-tuning data on math and coding benchmarks.

FEST is a few-shot reinforcement learning algorithm from researchers at the University of Illinois Urbana-Champaign that trains large language models for chain-of-thought reasoning tasks using only 128 randomly selected demonstrations. The method addresses a core bottleneck in Reinforcement Learning with Verifiable Rewards (RLVR): on difficult math and coding problems where correct rollouts are rare, standard RLVR burns through compute without improving, and prior demonstration-guided approaches require thousands of supervised fine-tuning examples to recover.
The paper, posted to arXiv this week by Kai Yan, Alexander Schwing, and Yu-Xiong Wang, identifies three components that make the few-shot approach work: a supervised signal from the small demonstration set, an on-policy RL signal that keeps the model exploring, and decaying weights on the SFT data across training epochs to prevent overfitting. On several benchmarks the team tested, FEST matched the performance of baselines trained on full SFT datasets—often tens of thousands of examples—while using orders of magnitude less labeled data.
The 128-demonstration threshold is not a tuned hyperparameter but a randomly sampled subset of existing SFT collections. The authors report that even at that scale, the combination of supervised and on-policy signals was enough to guide the model past the sample-efficiency wall that stalls pure RLVR on hard problems.