ContextRL trains LLMs to ground reasoning in supporting evidence
New RL method rewards models for choosing the right context from similar alternatives, gaining 2.2% on coding-agent benchmarks and 1.8% across visual QA tasks without simply adding more training data.

ContextRL is a reinforcement learning technique from researchers at Princeton, Michigan State, and the University of Illinois that teaches large language models to ground answers in specific pieces of evidence by making them choose between highly similar contexts. The method addresses a known failure mode: models that produce plausible answers without anchoring them to the decisive detail in a long tool trace or a subtle visual cue in an image.
Instead of rewarding only the final answer, ContextRL presents the model with a query, an answer, and two nearly identical contexts—then rewards it for selecting the one that actually supports the query–answer pair. The team built 1,000 contrastive trajectory pairs for coding agents (via condition filtering) and 7,000 contrastive image pairs for multimodal reasoning (via generative editing and similarity search). Training with this auxiliary objective improved performance by an average of 2.2 percentage points over standard GRPO on five long-horizon coding benchmarks and 1.8 points across twelve visual question-answering benchmarks.
What stands out
- 01The gain comes from the objective, not the data volume. The authors ran a control experiment: they took the same contrastive contexts and repurposed them as standard query–context–answer triples (no selection task). Those data-augmentation baselines delivered little to no improvement, isolating the effect of the context-selection reward.
- 02Two construction pipelines for two domains. For coding agents, trajectories are the contexts; the team filtered execution traces by success conditions to create minimal pairs. For vision, images are the contexts; the team used generative editing (changing one object or attribute) and similarity search to produce near-duplicate frames that differ on the detail that matters.
- 03Indirect supervision scales where direct supervision is expensive. Annotating which sentence in a 10,000-token tool trace supports an answer is prohibitively slow. Teaching the model to distinguish supporting from non-supporting contexts sidesteps that bottleneck and transfers to held-out reasoning tasks.



