Loading…

Mid-training on self-generated reasoning paths boosts RL gains across math and code | UncensoredHub

ReleasesResearchNSFW

Mid-training on self-generated reasoning paths boosts RL gains across math and code

New research shows that fine-tuning language models on multiple self-generated solution paths before reinforcement learning improves performance on math benchmarks and out-of-distribution tasks.

May 18, 2026

Mid-training on self-generated reasoning paths boosts RL gains across math and code

A preprint posted this week demonstrates that language models trained on diverse self-generated reasoning data before reinforcement learning outperform models that skip that intermediate step. The method, called mid-training, uses a bootstrapped data-generation framework inspired by George Polya's problem-solving strategies to expand training diversity before RL begins.

For each training question, the model generates multiple correct answers that rely on different reasoning approaches—algebraic manipulation, visual reasoning, or working backward from the solution, for example. The model is then fine-tuned on this expanded dataset before RL training begins. Instead of one reference solution per problem, the expanded dataset exposes the model to more reasoning strategies than a fixed dataset would.

What stands out

01Theory explains why diversity helps RL. The authors provide a theoretical argument that policy-gradient updates during RL can incentivize combining multiple approaches when the model has already seen them during mid-training. Models that only see one solution path per problem during supervised fine-tuning may not explore alternative strategies during RL.
02Gains hold across mathematical benchmarks and beyond. The paper reports consistent improvements on GSM8K, MATH, and other mathematical reasoning datasets. The method also transfers to out-of-distribution tasks: code generation on HumanEval and narrative reasoning on a separate benchmark both show gains over baselines that skip mid-training.
03The framework is model-agnostic. The experiments use existing open-weight models as base checkpoints. The mid-training step does not require architectural changes or new RL algorithms—it slots in before standard policy-gradient training.
04Polya's heuristics structure the generation process. The self-generation process is explicitly guided by Polya's four-step problem-solving framework: understanding the problem, devising a plan, carrying out the plan, and looking back. This structure increases solution diversity without requiring manual annotation.

What stands out

More in Releases