4B policy network beats GPT-5 by routing queries to frozen expert models

A lightweight reinforcement-learning orchestrator dynamically selects which frozen expert model and skill to invoke for each step, achieving 70.1% average accuracy across ten benchmarks—surpassing GPT-5 and Gemini-2.5-Pro without step-level supervision.

ByAlex Sokoloff·May 18, 2026

4B policy network beats GPT-5 by routing queries to frozen expert models

Researchers have shown that a 4-billion-parameter policy network can outperform frontier closed models by orchestrating ensembles of frozen experts rather than consolidating all knowledge into a single monolithic LLM.

Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), presented in a preprint posted May 22, treats heterogeneous multimodal tasks as sequential decision problems over a hierarchical registry of models and skills. At each step, the orchestrator decides whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based reinforcement learning with no step-level supervision—meaning the system learns from final outcomes alone, not intermediate step labels.

Across ten multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis, Maestro's 4B orchestrator achieves 70.1% average accuracy, edging GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining. When the registry is augmented with out-of-domain experts, performance reaches 59.5% average on four challenging benchmarks, outperforming all closed-source baselines tested.

The authors frame the result as evidence that dynamic composition of frozen specialist models can match or exceed single large models trained end-to-end, while maintaining computational efficiency and low latency. Source code is available at github.com/jinyangwu/Maestro.

ZenCreator

4B policy network beats GPT-5 by routing queries to frozen expert models

More in Releases

Mistral AI hits $6 billion valuation on $1 billion-plus funding round

Archive of Our Own faces grassroots AI-detection campaign with accuracy concerns

LTX-Video 2.3 NSFW LoRA enables multi-concept uncensored video generation

Qwen2.5-Coder-7B uncensored fine-tune drops on HuggingFace

Open-source voice agent stacks Gemma 4, Parakeet, and Qwen3TTS on 9,000 Reachy robots