4B policy network beats GPT-5 by routing queries to frozen expert models
A lightweight reinforcement-learning orchestrator dynamically selects which frozen expert model and skill to invoke for each step, achieving 70.1% average accuracy across ten benchmarks—surpassing GPT-5 and Gemini-2.5-Pro without step-level supervision.

Researchers have shown that a 4-billion-parameter policy network can outperform frontier closed models by orchestrating ensembles of frozen experts rather than consolidating all knowledge into a single monolithic LLM.
Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), presented in a preprint posted May 22, treats heterogeneous multimodal tasks as sequential decision problems over a hierarchical registry of models and skills. At each step, the orchestrator decides whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based reinforcement learning with no step-level supervision—meaning the system learns from final outcomes alone, not intermediate step labels.
Across ten multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis, Maestro's 4B orchestrator achieves 70.1% average accuracy, edging GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining. When the registry is augmented with out-of-domain experts, performance reaches 59.5% average on four challenging benchmarks, outperforming all closed-source baselines tested.
The authors frame the result as evidence that dynamic composition of frozen specialist models can match or exceed single large models trained end-to-end, while maintaining computational efficiency and low latency. Source code is available at github.com/jinyangwu/Maestro.