SWE Atlas evaluates coding agents on Q&A, testing, and refactoring—not just bug fixes

Researchers introduce SWE Atlas, a 284-task benchmark suite evaluating coding agents on codebase Q&A, test writing, and refactoring—workflows underrepresented in prior benchmarks.

May 13, 2026

SWE Atlas evaluates coding agents on Q&A, testing, and refactoring—not just bug fixes

SWE Atlas, a new benchmark suite introduced in an arXiv preprint this week, measures coding agent performance across three professional software engineering workflows: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). The benchmark shifts focus from issue resolution—the dominant task in existing SWE benchmarks—to workflows that better reflect day-to-day engineering work. Tasks are intentionally under-specified to mirror real-world usage, where agents must explore codebases and reason about runtime behavior without explicit step-by-step instructions.

Evaluation combines programmatic checks with rubric-based assessment, measuring not just functional correctness but also software engineering quality: test completeness, refactor maintainability, reusable abstractions, and codebase hygiene. Testing across frontier models (GPT-5.4, Opus 4.7) and open-weight alternatives revealed a sharp divide. GPT-5.4 and Opus 4.7 achieved the strongest overall scores, while open-weight models lagged significantly. Top performers relied on extensive codebase exploration and runtime-driven reasoning, but even the best models struggled with subtle edge cases, complex runtime analysis, and adherence to best practices.

The under-specified task design and rubric-based scoring represent a step toward evaluating agents as they're actually used in production. While frontier models show promise on Q&A and refactoring, test-writing quality remains inconsistent—agents often miss edge cases or produce brittle assertions. The benchmark and evaluation code are expected to be released alongside the preprint. Watch for follow-up releases that clarify whether the dataset includes real-world codebases or synthetic repositories, and whether rubric-based scoring can be automated at scale without human review.

More in Research