ReCrit framework lifts critic-stage accuracy 13 points on scientific reasoning tasks
A new reinforcement learning approach treats AI critic interaction as a turn-by-turn correctness problem, rewarding models that hold correct answers under user pushback while fixing genuine errors.
Large language models often flip from a correct scientific answer to an incorrect one the moment a user questions them—a behavior researchers now call "sycophancy" in critic interaction. ReCrit, a transition-aware reinforcement learning framework published on arXiv this week, addresses that problem by rewarding models that distinguish useful correction from harmful over-compliance.
The framework frames critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem. Instead of optimizing only for whether the final answer is right, ReCrit tracks what happens between turns—whether a model abandons a correct solution under pressure or successfully fixes a genuine mistake. The framework decomposes behavior into four quadrants: Correction (wrong-to-right transitions), Sycophancy (right-to-wrong), Robustness (right-to-right), and Boundary (wrong-to-wrong). ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals that provide minimal training gradient.
This approach is especially relevant for scientific reasoning, where user criticism can turn a valid answer into an incorrect one. A model that caves to uninformed pushback may be polite but unreliable. ReCrit's quadrant structure gives the training loop explicit signals about which behaviors to amplify and which to suppress, rather than collapsing everything into a single accuracy metric.
To make multi-turn interaction training practical at scale, ReCrit uses dynamic asynchronous rollout with tail-adaptive completion—a design choice that cuts rollout waiting time, a common bottleneck when generating multi-turn dialogues for reinforcement learning.
On three scientific reasoning benchmarks—ChemBench, TRQA, and EarthSE—ReCrit improved average critic-stage accuracy from 38.15% to 51.49% on Qwen3.5-4B and from 45.40% to 55.59% on Qwen3.5-9B. The 13-point gain on the 4B model and 10-point gain on the 9B model both represent meaningful jumps in how often the model holds or corrects its answer appropriately after user feedback. Ablations in the paper show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net improvement at the critic stage.
Code is available on GitHub for practitioners who want to apply the framework to other reasoning domains or model families.
