AgentKernelArena benchmark reveals GPU kernel agents generalize poorly on unseen input shapes
New benchmark tests AI coding agents on 196 GPU kernel optimization tasks, revealing near-perfect compilation but significant generalization gaps when agents encounter configurations outside their training scope.

AI coding agents can compile and optimize GPU kernels at near-perfect correctness rates, but they struggle to generalize those optimizations beyond the input configurations they've seen during development.
AgentKernelArena, a new open-source benchmark from researchers including Sharareh Younesian, Wenwen Ouyang, Sina Rafati, Mehdi Rezagholizadeh, Sharon Zhou, and Ji Liu, evaluates complete AI agent workflows on GPU kernel optimization. The benchmark spans 196 tasks in three categories: HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation. Each task runs in an isolated workspace where agents iteratively read code, invoke compilers and profilers, and refine implementations until they pass gated compilation, correctness, and performance checks.
Testing production agents including Cursor Agent, Claude Code, and Codex Agent, the researchers found the strongest configurations achieved mean speedups of 6.89× on PyTorch-to-HIP tasks, 6.69× on HIP-to-HIP, and 2.13× on Triton-to-Triton. Compilation and correctness rates were high across most categories.
The critical finding emerged in the unseen-configuration protocol. HIP-to-HIP and Triton-to-Triton optimizations largely transferred to input shapes agents had never encountered. PyTorch-to-HIP tasks, however, showed substantial correctness drops on unseen configurations. Agents generating kernels from scratch frequently hardcoded shape-specific assumptions rather than writing general implementations—a sign that current agents optimize for the immediate task rather than robust, reusable code.
The framework is designed as modular and extensible, enabling future evaluations across different agents, task types, and hardware targets. The preprint was published on arXiv on May 19, 2026.