Hard reasoning examples resist learning in RLVR, even with correct solutions

New research finds that a subset of difficult examples remains unlearnable in reinforcement learning with verifiable reward, even when correct solutions are present, due to low gradient similarity and ungeneralizable patterns.

May 18, 2026

Hard reasoning examples resist learning in RLVR, even with correct solutions

Reinforcement learning with verifiable reward can sharpen a language model's reasoning, but a new preprint reveals a stubborn limitation: some hard examples never click—even when the training data includes correct rollouts.

Yulin Chen, He He, and Chen Zhao document the phenomenon in a paper released this week. The team trained models on reasoning tasks using RLVR, the approach that rewards outputs against a formal verifier rather than a learned critic. Among examples the model initially failed, they found a substantial fraction stayed stuck throughout training. Tweaking the optimizer, the sampling strategy, or the batch composition did not move the needle.

Cross-example gradient analysis revealed the root cause: unlearnable examples produce gradients with low cosine similarity to the rest of the batch, signaling that the model's internal representation of those problems is fundamentally misaligned. The reasoning patterns in those examples also fail to generalize—what the model learns from one unlearnable case does not transfer to similar problems. Data augmentation, a standard fix for representation issues in supervised learning, did not improve gradient similarity in the RL setting, suggesting the flaw runs deeper than sample diversity.

The paper characterizes unlearnability as a representation problem rather than an optimization or coverage problem. The authors released code and data at https://github.com/yulinchen99/unlearnability-rlvr for practitioners who want to audit their own RLVR pipelines. The work is the first systematic look at which examples resist learning in verifiable-reward RL, and it suggests current methods may need architectural or algorithmic changes to handle the full distribution of reasoning tasks.

More in Releases