PlanningBench framework generates 30+ task types to train LLMs on coupled constraints
Researchers introduce PlanningBench, a framework that synthesizes diverse planning problems with automatic verification, revealing that frontier models still struggle with coupled constraints.

PlanningBench is a framework for generating scalable, verifiable planning data to evaluate and train large language models. The system abstracts real-world planning workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors, then uses constraint-driven synthesis to produce self-contained problems with instance-level verification checklists. This approach shifts planning data construction from fixed benchmark collection to controllable generation, enabling adaptive difficulty control and quality filtering while preserving realistic task grounding.
Evaluations using PlanningBench show that current open-source and closed-source frontier LLMs still struggle to produce complete solutions when multiple constraints interact. Beyond benchmarking, reinforcement learning on verified PlanningBench data improved performance on unseen planning benchmarks and broader instruction-following tasks. The authors note that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics than open-ended planning scenarios. The framework covers planning scenarios that require models to coordinate goals, constraints, resources, and long-term consequences into executable solutions—a capability the paper identifies as fundamental to LLM reasoning.