Claude Opus 4.8 hits 69.2% on agent coding with self-assessment fixes

Anthropic's Claude Opus 4.8 ships with improved self-assessment, longer autonomous context retention, and a 4.9-point gain over 4.7 on agent coding benchmarks, now ahead of GPT-5.5.

ByAlex Sokoloff·May 30, 2026

Claude Opus 4.8 hits 69.2% on agent coding with self-assessment fixes

Anthropic released Claude Opus 4.8 this week with three concrete changes: sharper situational judgment, more honest self-reporting of limitations, and longer context retention during autonomous work without human prompts. The update targets a persistent pain point in agent workflows — models confidently declaring success when they've actually stalled mid-task.

On SWE-Bench Pro, the agentic coding benchmark, Opus 4.8 scores 69.2 percent compared to 64.3 percent for 4.7 and 58.6 percent for GPT-5.5. Computer use on OSWorld hits 83.4 percent. Knowledge work on GDPval-AA climbs to 1890 from 1753 in the prior version. Terminal coding remains GPT-5.5's lead at 78.2 percent versus 74.6 percent for Opus 4.8, though the gap is narrow. Pricing holds steady.

The self-assessment improvement is the headline feature for practitioners running long-running agents. If the model now reliably flags when it's stuck rather than hallucinating completion, that cuts manual checkpoint overhead. The real test will be whether multi-step refactoring tasks and debugging loops actually reflect the claimed honesty gains, or whether edge cases still produce false positives at the same rate as 4.7. Watch how the terminal coding gap closes in the next point release — Anthropic is within five percentage points of GPT-5.5 there, and the agentic coding lead suggests the architecture has room to tighten shell-command accuracy without sacrificing the new self-awareness layer.

ZenCreator

Claude Opus 4.8 hits 69.2% on agent coding with self-assessment fixes

More in Platform

Apple accuses OpenAI of soliciting hardware prototypes in job interviews

Lightweight proxy models cut LLM post-training costs while enabling cross-model signal reuse

Colibri runs 744B GLM-5.2 on 25GB RAM by streaming experts from disk

Anthropic extends Fable 5 preview a second week, bumps rate limits 50%

Soofi S 30B activates 3B parameters per token, tops European AI baselines