Reasoning models lose solution diversity when fine-tuning skips decision points
New research identifies why specialized reasoning models show pass@k degradation despite higher pass@1 scores—fine-tuning data with too few branching decision points collapses solution diversity.
A preprint from researchers at Virginia Tech and George Mason University explains why reasoning-focused language models often show worse pass@k performance than their base models, even as pass@1 accuracy climbs. The coverage shrinkage stems from fine-tuning data that underrepresents "forks in the road"—decision points where multiple valid reasoning paths exist but patterns are hard to distinguish.
The phenomenon matters because reasoning models have become a key focus for practitioners working with open-weight releases. Specialized fine-tuning procedures reliably push pass@1 scores higher on complex tasks, making them attractive for deployment. But the pass@k metric—which measures whether at least one correct answer appears in k attempts—tells a different story. Prior work has observed that these same models often generate fewer distinct correct solutions than their untuned base counterparts, a behavior the authors call "coverage shrinkage."
The authors designed controlled experiments to isolate the cause. They built case studies around graph branching tasks and reasoning modes, tracking how models behaved during and after supervised fine-tuning. The pattern was consistent: when training data contained few decision-point scenarios—moments where the model faces indecipherable patterns with multiple valid reasoning paths—the model converged on a narrower solution space. More decision points in the data correlated with better pass@k retention.
Two mitigation strategies showed promise. Targeted data synthesis that deliberately includes more branching scenarios reduced shrinkage in the controlled settings. A diversity-encouraging decoding mechanism also helped, though both are partial fixes rather than complete solutions. The findings point to data-centric design as the primary lever for controlling coverage in reasoning models, with implications for anyone fine-tuning reasoning models on open weights.
