SR²AM 30B matches 685B-scale systems while cutting reasoning tokens by 95%
A new preprint introduces SR²AM, an LLM architecture that decomposes agentic reasoning into simulative planning, self-regulation, and reactive execution, achieving competitive accuracy with far larger models while cutting token use by up to 95 percent.

Agentic reasoning systems waste tokens on undifferentiated chain-of-thought when they should be deciding whether to plan at all—a structural inefficiency that researchers aim to fix by teaching models when and how deeply to simulate future states.
SR²AM (Self-Regulated Simulative Reasoning Agentic LLM), described in a preprint posted May 22, splits decision-making into three subsystems borrowed from cognitive science: simulative reasoning (System II) for deliberate planning via a world model, self-regulation (System III) for deciding when planning is worth the cost, and reactive execution (System I) for immediate action. The architecture realizes all three as distinct stages within a single LLM's chain-of-thought, with the LLM itself serving as the world model.
The team tested two instantiations: v0.1 records decisions from a prompted multi-module system, while v1.0 reconstructs structured plans from traces of pretrained reasoning LLMs and trains the result via supervised learning followed by reinforcement learning. Across math, science, tabular analysis, and web information seeking, v0.1-8B achieves Pass@1 competitive with 120–355B parameter systems, and v1.0-30B matches models in the 685B–1T range while using 25.8–95.3% fewer reasoning tokens than comparable agentic LLMs.
Reinforcement learning increases the average planning horizon by 22.8% while planning frequency grows only 2.0%—the model learns to plan further ahead rather than more often. A companion preprint on simulative reasoning (SiRA) reports that the underlying architecture improves task completion rates by up to 124% over reactive baselines in constrained navigation, multi-hop information aggregation, and general instruction following, lifting constrained navigation success from 0% to 32.2% compared to a representative open-web agent.
The work frames self-regulation as a learned meta-skill that could extend beyond planning to how agents govern their own learning and adaptation.