WildClawBench: Claude Opus 4.7 scores 62% on native-runtime agent tasks
A new benchmark runs frontier models through native CLI harnesses and Docker containers for 8-minute, 20-tool-call tasks—Claude Opus 4.7 tops out at 62.2%, every other model below 60%.

Researchers argue that real-world, long-horizon agent work demands native-runtime evaluation, not synthetic sandboxes and mock APIs. WildClawBench, a new benchmark released this week, puts that principle into practice with 60 human-authored tasks that run inside reproducible Docker containers hosting actual CLI agent harnesses—OpenClaw, Claude Code, Codex, or Hermes Agent—with access to real tools rather than mock services. Each task averages roughly eight minutes of wall-clock time and more than 20 tool calls, spanning six thematic categories and presented in both English and Chinese with multimodal inputs.
The grading system combines deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, Claude Opus 4.7 reaches 62.2% overall under the OpenClaw harness—the highest score recorded—while every other model stays below 60%. Switching harness alone shifts a single model's performance by up to 18 percentage points, underscoring how tightly agent capability is coupled to runtime design.
Authors Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, and Yang JingYi released the tasks, code, and containerized tooling to support reproducible evaluation. The benchmark is designed to run in the same environments where production agents are deployed, suggesting that long-horizon, native-runtime agent evaluation remains a far-from-resolved challenge for current frontier models.