ZenCreator

Pro-grade AI content creation. Image, video, face-swap, lipsync, and upscaling behind one API.

14 tools

Up to 4K

4.4(288)

Visit

Loading…

Qwen-AgentWorld 397B tops Claude Opus and GPT-5.4 on six-environment benchmark | UncensoredHub

ReleasesResearchNSFW

Qwen-AgentWorld 397B tops Claude Opus and GPT-5.4 on six-environment benchmark

Alibaba's Qwen released open-weight AgentWorld models that simulate real-world agent environments and outperform closed frontier models on agentic tasks.

ByAlex Sokoloff·June 25, 2026

Qwen-AgentWorld 397B tops Claude Opus and GPT-5.4 on six-environment benchmark

Alibaba's Qwen team released Qwen-AgentWorld, an open-weight model family that simulates six real-world environments for AI agents—web browsing, terminal, coding, search, OS, and Android—and outperforms closed frontier models on agentic benchmarks. The 397B parameter variant scored 58.71 on the AgentWorld benchmark, surpassing Claude Opus 4.8 and GPT-5.4, while the 35B MoE model beat Sonnet 4.6. Weights are available now on HuggingFace and ModelScope, with a full paper on arXiv and code on GitHub.

The models are designed to test agent capabilities in environments that mirror real developer and user workflows. The 397B model showed the strongest gains in coding, web navigation, and terminal tasks—the three domains where agentic systems most often fail in production. The 35B MoE variant delivers competitive performance at a fraction of the parameter count, making it viable for local deployment on consumer hardware.

What stands out

01Open weights beat closed frontier models. The 397B variant scored 58.71 on the AgentWorld benchmark, ahead of Claude Opus 4.8 and GPT-5.4. The 35B MoE model outperformed Sonnet 4.6, marking one of the first times an open-weight model has topped a closed Anthropic release on an agentic eval.
02Six simulated environments in one model. AgentWorld tests agents across web, terminal, coding, search, OS, and Android—a broader environment set than prior agent benchmarks, which typically focus on one or two domains. The model handles context switching between environments without task-specific fine-tuning.
03Coding and terminal tasks saw the biggest gains. The 397B model's performance jump was most pronounced in coding, where it handles multi-file edits and dependency resolution, and terminal tasks, where it executes shell commands and parses output. Web navigation improved but remained the hardest domain.
04

ZenCreator

Qwen-AgentWorld 397B tops Claude Opus and GPT-5.4 on six-environment benchmark

What stands out

More in Releases

Five uncensored Qwen3.6-35B fine-tunes surface on HuggingFace in 24 hours

NormGuard preserves image quality in flow-model RL fine-tuning by capping velocity inflation

PP-OCRv6 scales from 1.5M to 34.5M parameters across 50 languages

OpenAI previews GPT-5.6-sol reasoning model for Pro and Enterprise users

OpenAI previews GPT-5.6 Sol with stronger coding and cybersecurity