Qwen-AgentWorld 397B tops Claude Opus and GPT-5.4 on six-environment benchmark
Alibaba's Qwen released open-weight AgentWorld models that simulate real-world agent environments and outperform closed frontier models on agentic tasks.
Alibaba's Qwen team released Qwen-AgentWorld, an open-weight model family that simulates six real-world environments for AI agents—web browsing, terminal, coding, search, OS, and Android—and outperforms closed frontier models on agentic benchmarks. The 397B parameter variant scored 58.71 on the AgentWorld benchmark, surpassing Claude Opus 4.8 and GPT-5.4, while the 35B MoE model beat Sonnet 4.6. Weights are available now on HuggingFace and ModelScope, with a full paper on arXiv and code on GitHub.
The models are designed to test agent capabilities in environments that mirror real developer and user workflows. The 397B model showed the strongest gains in coding, web navigation, and terminal tasks—the three domains where agentic systems most often fail in production. The 35B MoE variant delivers competitive performance at a fraction of the parameter count, making it viable for local deployment on consumer hardware.
What stands out
- 01Open weights beat closed frontier models. The 397B variant scored 58.71 on the AgentWorld benchmark, ahead of Claude Opus 4.8 and GPT-5.4. The 35B MoE model outperformed Sonnet 4.6, marking one of the first times an open-weight model has topped a closed Anthropic release on an agentic eval.
- 02Six simulated environments in one model. AgentWorld tests agents across web, terminal, coding, search, OS, and Android—a broader environment set than prior agent benchmarks, which typically focus on one or two domains. The model handles context switching between environments without task-specific fine-tuning.
- 03Coding and terminal tasks saw the biggest gains. The 397B model's performance jump was most pronounced in coding, where it handles multi-file edits and dependency resolution, and terminal tasks, where it executes shell commands and parses output. Web navigation improved but remained the hardest domain.
- 04




