PROPEL doubles learnable task generation for code agents without solver rollouts
A new solver-amortized training method shifts synthetic task generation toward the learnable frontier, increasing valid training material from 10% to 20% for code agents without repeated solver rollouts.

Researchers have identified a critical bottleneck in reinforcement-learning agent training: most synthetically generated tasks are either trivial or unsolvable, and filtering for the narrow band of learnable challenges requires expensive solver rollouts to evaluate each candidate.
PROPEL, introduced in a June 2026 arXiv preprint, replaces that bottleneck with a lightweight activation probe trained once on labeled task-outcome pairs. The probe predicts whether a generated task will land in the target solve-rate window—hard enough to teach, easy enough to solve—by reading internal states from a frozen reference model. During generator optimization the probe acts as a fast proxy for actual solver evaluation, cutting the cost of each candidate from tens of minutes to a single forward pass.
Across math, code, and software-engineering benchmarks the method roughly doubles the share of generations that fall in the learnable band. For a Qwen2.5-3B-Instruct code solver, tasks at the targeted difficulty rose from 10.1% to 20.0%; for Qwen2.5-7B-Instruct the share climbed from 5.3% to 12.6%. On software-engineering repositories unseen during training, PROPEL lifted the learnable fraction from 9.8% to 19.6% for Qwen3.5-27B.
The paper frames the result as a shift from solver-in-the-loop generation—where every candidate triggers a full agent run—to solver-amortized generation, where a one-time labeling pass trains a predictor that guides all subsequent task creation. As frontier models improve, fixed task distributions saturate quickly; PROPEL offers a path to scale task supply in step with solver capability without multiplying compute costs by the number of candidates evaluated.



