Claw-SWE-Bench multilingual benchmark exposes 54-point adapter gap in coding agents
A new 350-instance benchmark across eight languages shows OpenClaw's minimal adapter scores 19.1% Pass@1 while the full adapter hits 73.4% on the same GLM 5.1 backbone, revealing harness design matters as much as model choice.

Claw-SWE-Bench is a multilingual coding benchmark that measures how well general-purpose agent harnesses—what the authors call "claws"—solve real GitHub issues. The benchmark contains 350 issue-resolution tasks spanning eight programming languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after cleaning out future commits. An 80-instance Lite subset offers faster validation using a cost-aware selection procedure.
The benchmark standardizes the evaluation contract—fixed prompts, runtime budgets, Docker workspace setup, patch extraction, and scoring—so that heterogeneous agent frameworks can be compared fairly. That standardization matters: OpenClaw paired with a minimal direct-diff adapter scores only 19.1% Pass@1 on the full benchmark, but swapping in the full adapter lifts the same GLM 5.1 backbone to 73.4%. The 54-percentage-point gap shows adapter design is as critical as the underlying model.
Harness and cost as evaluation axes
Across an OpenClaw sweep over nine models and a five-harness sweep over two models, changing the model shifts Pass@1 by 29.4 percentage points while changing the harness moves it 27.4 points when the model is held constant. Systems with similar accuracy can differ substantially in total API cost, which the benchmark tracks as a first-class metric alongside pass rate. The authors argue that treating harness choice and cost accounting as evaluation dimensions—not just model accuracy—gives a more complete picture of production-ready coding agents.
The dataset and adapter protocol are available on GitHub and HuggingFace. The preprint appears on arXiv.






