Metal-Sci benchmark catches LLM-generated Apple Silicon kernels that pass training but fail at scale
A new 10-task scientific compute benchmark for Apple Silicon exposes silent regressions in LLM-written Metal kernels, including a GPT 5.5 FFT that wins 2.95× in-distribution but collapses to 0.23× on held-out sizes.

Metal-Sci is a benchmark suite from Víctor Gallego that tests whether large language models can write fast scientific compute kernels for Apple Silicon's Metal API. The suite ships 10 tasks across six optimization regimes — stencils, n-body all-pairs, multi-field Boltzmann, neighbor-list molecular dynamics, multi-kernel PDE solvers, and FFT — each with a CPU reference, a roofline-anchored fitness function, and a held-out generalization size the model never sees during search. A lightweight harness runtime-compiles each candidate kernel, scores it against the roofline across multiple problem sizes, and feeds structured compile and correctness diagnostics back to a frozen LLM driving a (1+1) evolutionary loop. Single-model sweeps of Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5 on M1 Pro hardware delivered in-distribution self-speedups from 1.00× to 10.7×.
The paper's central claim is methodological: a held-out gate scoring function Φ_T — evaluated once at the end of the run on a configuration the agent never trained on — acts as a cheap mechanical oversight primitive. That gate caught an Opus template HMC kernel that returned correct samples at training dimensions but wrong samples at unseen dimensions, and a GPT FFT3D kernel that won in-distribution at 2.95× speedup but collapsed to 0.23× on a 256³ held-out cube. The preprint and code are available at github.com/vicgalle/metal-sci-kernels.