Custom llama.cpp build unlocks flash attention on RDNA2, hitting 70–80 tok/s on Qwen 3.6
A developer bypassed a crash-inducing assert in llama.cpp's ROCm backend to enable flash attention on AMD RDNA2 GPUs, pushing Qwen 3.6 35B from unrunnable to 70-80 tokens per second.
A custom llama.cpp binary released this week doubles inference speed on AMD RDNA2 GPUs by working around a flash attention crash that blocks stock ROCm builds. The patched release, available on GitHub, targets gfx1030 and gfx1031 architectures — the Radeon RX 6000 series — and reports 70–80 tokens per second on Qwen 3.6 35B quantized models, compared to roughly 30 tok/s under Vulkan and zero under stock ROCm, which hits an assertion failure before inference starts.
The crash stems from hipOccupancyMaxActiveBlocksPerMultiprocessor returning zero on RDNA2 hardware, triggering an assert in llama.cpp's CUDA-derived flash attention kernel at ggml-cuda/fattn-common.cuh:1054. The workaround patches the assert and logs the occupancy query instead of halting, proving the GPU has sufficient memory to run the kernel. The release includes a modified CMake build targeting gfx1030/gfx1031 with the -DGGML_FATTN_TRACE flag and ships a precompiled llama-server binary configured for multi-token prediction draft mode, 64k context, and 50-layer GPU offload.
Stability remains uneven. Gemma crashes on larger contexts and Deepseek runs slowly, with confirmed working inference limited to Qwen 3.6 27B and 35B. The patch is explicitly a workaround — ROCm's occupancy reporting bug persists upstream, and the fix may break on models that rely on accurate block-per-SM counts. If AMD or the llama.cpp maintainers address the underlying HIP query issue, the patch could land in mainline builds and unlock flash attention across the RX 6000 lineup without custom binaries.
