MM-ToolBench exposes 62-point gap between Claude Opus 4.6 and human performance on multimodal tool tasks
New benchmark from six researchers evaluates multimodal tool-using agents on 100 executable customer-service and creative workflows, revealing how current models struggle with closed-loop artifact inspection and self-correction.

MM-ToolBench is a task-oriented benchmark that tests whether agents can interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and self-correct before delivering a final result. The 100 executable tasks span customer service and intelligent creation workflows, broken into 20 subcategory slices and supported by 27 MCP servers hosting 324 tools. Unlike prior benchmarks that isolate tool use or computer use from multimodal reasoning, MM-ToolBench requires closed-loop multimodal verification — agents must execute a tool, render or transform the output, check whether it meets task-specific requirements, and revise if it fails.
The benchmark addresses a gap that has become increasingly visible as frontier models add function-calling and computer-use capabilities but struggle to chain them reliably across realistic professional workflows. Existing evals often measure whether an agent can call a single API or click through a scripted browser session, but they rarely test the full loop of tool execution, artifact inspection, and self-correction that defines real-world productivity tasks. MM-ToolBench's two macro task families — customer service and intelligent creation — were chosen to reflect domains where multimodal reasoning and tool coordination are both essential and where intermediate outputs (rendered images, transformed documents, API responses) must be verified before the agent proceeds.
The evaluation harness couples MCP-based execution with task-specific grounded evaluators and a semi-automated pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show the benchmark remains highly challenging. Claude Opus 4.6, widely considered one of the strongest coding-agent models, achieved 32.0 percent task success, while human annotators reached 94.0 percent. The 62-point gap suggests that current models struggle with the inspect-and-revise loop even when they can execute individual tool calls correctly.