FrontierSmith auto-generates open-ended coding problems, boosts Qwen3.5 by 8–12 points
Researchers released FrontierSmith, a system that evolves competitive programming tasks into open-ended variants, then filters by solution diversity; training on the synthetic data pushed Qwen3.5 models ahead by 8-12 points on two benchmarks.

Open-ended coding problems—tasks with no single correct answer—have long been a weak spot for LLMs, largely because training data for them is scarce and expensive to hand-curate.
FrontierSmith, detailed in a preprint released May 15, is an automated pipeline that transforms closed-ended competitive programming problems into open-ended variants. The system mutates problems by changing goals, restricting outputs, or generalizing inputs, then scores each candidate using a quantitative "idea divergence" metric that measures how many distinct solution strategies different solvers attempt. Only problems that elicit genuinely diverse approaches survive the filter. Agent-generated test cases and verifiers complete each selected problem.
The authors trained Qwen3.5-9B and Qwen3.5-27B on the synthetic dataset and evaluated on FrontierCS and ALE-bench, two open-ended coding benchmarks. The 9B model improved by +8.82 points on FrontierCS and +306.36 Elo rating on ALE-bench; the 27B model gained +12.12 and +309.12, respectively. Notably, the synthetic problems also caused agents to take more turns and consume more tokens per solution—behavior that mirrors human-curated open-ended tasks. The authors argue that competitive programming problems, which are abundant and well-structured, offer a practical foundation for scaling long-horizon coding training data, bypassing the scarcity bottleneck that has kept open-ended datasets small.