PROPEL doubles learnable task generation for code agents without solver rollouts

A new solver-amortized training method shifts synthetic task generation toward the learnable frontier, increasing valid training material from 10% to 20% for code agents without repeated solver rollouts.

ByAlex Sokoloff·June 19, 2026

PROPEL doubles learnable task generation for code agents without solver rollouts

Researchers have identified a critical bottleneck in reinforcement-learning agent training: most synthetically generated tasks are either trivial or unsolvable, and filtering for the narrow band of learnable challenges requires expensive solver rollouts to evaluate each candidate.

PROPEL, introduced in a June 2026 arXiv preprint, replaces that bottleneck with a lightweight activation probe trained once on labeled task-outcome pairs. The probe predicts whether a generated task will land in the target solve-rate window—hard enough to teach, easy enough to solve—by reading internal states from a frozen reference model. During generator optimization the probe acts as a fast proxy for actual solver evaluation, cutting the cost of each candidate from tens of minutes to a single forward pass.

Across math, code, and software-engineering benchmarks the method roughly doubles the share of generations that fall in the learnable band. For a Qwen2.5-3B-Instruct code solver, tasks at the targeted difficulty rose from 10.1% to 20.0%; for Qwen2.5-7B-Instruct the share climbed from 5.3% to 12.6%. On software-engineering repositories unseen during training, PROPEL lifted the learnable fraction from 9.8% to 19.6% for Qwen3.5-27B.

The paper frames the result as a shift from solver-in-the-loop generation—where every candidate triggers a full agent run—to solver-amortized generation, where a one-time labeling pass trains a predictor that guides all subsequent task creation. As frontier models improve, fixed task distributions saturate quickly; PROPEL offers a path to scale task supply in step with solver capability without multiplying compute costs by the number of candidates evaluated.

ZenCreator

PROPEL doubles learnable task generation for code agents without solver rollouts

More in Research

Anthropic opens Seoul office, expands Claude partnerships across Korea

Supervised Memory Training lets RNNs learn in parallel without backprop through time

O'Reilly preprint: mammalian cortex approximates backpropagation via 200-millisecond theta cycles

KV cache edits cut LLM latency 53–398× while preserving accuracy

DF3DV-1K dataset ships 1,048 scenes for distractor-free 3D reconstruction