ZenCreator

Pro-grade AI content creation. Image, video, face-swap, lipsync, and upscaling behind one API.

14 tools

Up to 4K

4.4(288)

Visit

Loading…

ReleasesResearchPlatform

Claw-SWE-Bench multilingual benchmark exposes 54-point adapter gap in coding agents

A new 350-instance benchmark across eight languages shows OpenClaw's minimal adapter scores 19.1% Pass@1 while the full adapter hits 73.4% on the same GLM 5.1 backbone, revealing harness design matters as much as model choice.

ByAlex Sokoloff·June 11, 2026

Claw-SWE-Bench multilingual benchmark exposes 54-point adapter gap in coding agents

Claw-SWE-Bench is a multilingual coding benchmark that measures how well general-purpose agent harnesses—what the authors call "claws"—solve real GitHub issues. The benchmark contains 350 issue-resolution tasks spanning eight programming languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after cleaning out future commits. An 80-instance Lite subset offers faster validation using a cost-aware selection procedure.

The benchmark standardizes the evaluation contract—fixed prompts, runtime budgets, Docker workspace setup, patch extraction, and scoring—so that heterogeneous agent frameworks can be compared fairly. That standardization matters: OpenClaw paired with a minimal direct-diff adapter scores only 19.1% Pass@1 on the full benchmark, but swapping in the full adapter lifts the same GLM 5.1 backbone to 73.4%. The 54-percentage-point gap shows adapter design is as critical as the underlying model.

Harness and cost as evaluation axes

Across an OpenClaw sweep over nine models and a five-harness sweep over two models, changing the model shifts Pass@1 by 29.4 percentage points while changing the harness moves it 27.4 points when the model is held constant. Systems with similar accuracy can differ substantially in total API cost, which the benchmark tracks as a first-class metric alongside pass rate. The authors argue that treating harness choice and cost accounting as evaluation dimensions—not just model accuracy—gives a more complete picture of production-ready coding agents.

The dataset and adapter protocol are available on GitHub and HuggingFace. The preprint appears on arXiv.

ZenCreator

Claw-SWE-Bench multilingual benchmark exposes 54-point adapter gap in coding agents

Harness and cost as evaluation axes

More in Releases

PAJAMA distills LLM judges into programs, cuts eval cost by 100×

Molt: NVIDIA's PyTorch framework cuts agentic RL iteration cost

Hypernetworks outscale LoRA for train-time knowledge injection in LLMs

Staleness-Adaptive Trust Region cuts asynchronous RL performance loss to 3% at 8× policy lag

Distilled RL transfers knowledge across model families without unconditional imitation