CUDABeaver reveals 40-point score swings in LLM GPU debugging benchmarks
New arXiv preprint introduces 213 real-world CUDA debugging tasks to measure whether LLMs fix broken GPU code or just slow it down to pass tests.
CUDABeaver is a benchmark from researchers evaluating LLM-based CUDA debugging that separates genuine repair from what the authors call "repair by degeneration" — when a model passes correctness tests by simplifying broken GPU code into slower, safer programs that abandon the original optimization structure. The benchmark comprises 213 tasks drawn from real failing workspaces produced during LLM-based CUDA generation, each providing the broken candidate, native build and test commands, raw error evidence, and a single editable file. Current evaluations of LLM CUDA programming miss this distinction entirely, allowing models to appear competent while actually degrading performance.
The paper introduces pass@k(M,C,A), a protocol-conditional debugging metric that makes the fixer model M, corpus C, and protocol axes A explicit, then applies it across seven frontier LLMs. Results show that when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success. Scores shift by up to 40 percentage points depending on how much slowdown is allowed. The benchmark reports results by failure category, debugging trajectory, stagnation mode, and performance preservation, covering scientific computing, machine learning, graphics, and systems workloads where GPU usage has rapidly expanded.
