Sparse-to-dense reward principle lifts Qwen3-1.7B math accuracy to 78.5%
New preprint argues scarce labeled data should train large upstream models with sparse RL, then transfer dense supervision downstream—lifting Qwen3-1.7B MATH accuracy to 78.5% versus weaker direct GRPO baselines.

A preprint released this week challenges the standard practice of running sparse reinforcement learning directly on the small model you plan to deploy. Researchers at Alibaba and Amazon propose a sparse-to-dense reward principle: use scarce labeled verifiable data upstream on a large teacher model where exploration is productive, then compress that behavior downstream via dense token-level supervision. Testing on math reasoning with Qwen3 and Llama models, they found that an RL-improved 8B teacher distilled through a dense bridge consistently outperforms direct GRPO on the same 1.7B student.
The core insight is that GRPO-style sparse sequence-level reward and on-policy distillation-style dense teacher supervision are not separate recipes but different reward-density regimes. The allocation rule is straightforward: train the strongest available model on the labeled data using sparse reward, then transfer that reward-shaped behavior downstream as dense supervision. At fixed Qwen3-1.7B deployment size, a forward-KL warmup on teacher rollouts followed by on-policy distillation on student rollouts proved consistently strongest on MATH before any post-bridge student-side sparse RL. The bridge also made later student-side sparse RL effective—GRPO that is weak on a cold student lifted MATH accuracy from 75.4% to 78.5% after the bridge and outperformed a matched replay control by 2.8 points. Transfer from the same teacher before RL underperformed, suggesting the bridge is not just a warmup but a mechanism that makes scarce labeled data more productive.
The operational principle is to avoid using that data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge. The authors also report that the bridge gives the best pre-Stage 3 AIME endpoints for the canonical 8B and 14B teachers. Whether the sparse-to-dense allocation rule remains optimal for 70B+ students or for code and reasoning tasks with less clean ground truth remains an open question.