TMAS scales LLM reasoning with multi-agent memory and hybrid reward training
New preprint introduces TMAS, a test-time scaling framework that coordinates specialized agents using hierarchical memory banks and hybrid reward RL to improve reasoning on complex benchmarks.
TMAS, a test-time scaling framework from researchers George Wu, Nan Jing, Qing Yi, Chuan Hao, Ming Yang, and Feng Chang, organizes inference as a collaborative process among specialized agents with hierarchical memory structures. The approach addresses a core limitation in existing structured test-time methods: they either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what to retain and reuse, limiting their ability to balance exploration and exploitation.
The framework introduces two memory layers. An experience bank reuses low-level reliable intermediate conclusions and local feedback, while a guideline bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns. This explicit retention mechanism enables structured information flow across agents, trajectories, and refinement iterations.
Training uses a hybrid reward reinforcement learning scheme tailored to multi-agent collaboration. The scheme jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Experiments on challenging reasoning benchmarks show TMAS achieves stronger iterative scaling than existing test-time scaling baselines, with improved scaling effectiveness and stability across iterations.
What stands out
- 01Hierarchical memory replaces passive accumulation. The experience and guideline banks explicitly decide what to retain and reuse, rather than passively accumulating noisy historical data.
- 02Multi-agent coordination with structured flow. Specialized agents collaborate with defined information pathways across trajectories and refinement rounds, replacing weakly coordinated parallel reasoning.
- 03Hybrid reward training balances three objectives. The RL scheme preserves reasoning capability, improves experience reuse, and encourages novel exploration—addressing stability and effectiveness together.
