QASM-Eval: 4,100-task benchmark for hardware-level quantum code generation
New dataset trains LLMs on OpenQASM-3's hardware-facing features — mid-circuit measurement, pulse control, timing — revealing current models struggle without fine-tuning.

QASM-Eval is a dataset for training and evaluating large language models on OpenQASM-3, the hardware-level programming interface for quantum computers. The benchmark comprises 100 expert-verified test tasks and 4,000 training tasks systematically covering classical logic, timing scheduling, pulse control, and real-world quantum workflows — capabilities essential for quantum error correction, dynamical decoupling, and calibration in the Noisy Intermediate-Scale Quantum era. Unlike existing datasets focused on quantum algorithm design, QASM-Eval explicitly targets OpenQASM-3's hardware-oriented features: mid-circuit measurement, classical feedback loops, precise timing control, and pulse-level waveform access.
Evaluation of state-of-the-art LLMs revealed they struggle heavily with OpenQASM-3 coding tasks without training, but targeted fine-tuning on QASM-Eval yields significant accuracy gains. The researchers built an extended verifier that automatically validates generated programs by checking syntax correctness, quantum state evolution, and program timeline consistency. The dataset and tools are available at github.com/fuzhenxiao/QASM-Eval, providing the first standardized benchmark for hardware-facing quantum code generation.

