Loading…

ROCm 7.13 and MTP recover Strix Halo full-context speed, Vulkan stays stable | UncensoredHub

Platform

ROCm 7.13 and MTP recover Strix Halo full-context speed, Vulkan stays stable

AMD's ROCm 7.13 nightlies now compile on Strix Halo, and llama.cpp's merged Multi-Token Prediction recovers most of the 64% full-context decode penalty on a 35B MoE, while Vulkan holds steady with only 12% drop.

May 16, 2026

ROCm 7.13 and MTP recover Strix Halo full-context speed, Vulkan stays stable

ROCm 7.13 nightlies now compile shaders on gfx1151 — the first stable ROCm release for AMD's Strix Halo APU — and llama.cpp merged Multi-Token Prediction (MTP) to mainline on May 16. Full-context benchmarks show ROCm's decode speed collapses 64 percent at 76k prompt tokens without MTP, but enabling MTP cuts that penalty to 38 percent. Vulkan, by contrast, drops only 12 percent at full context whether MTP is on or off.

The tests ran three models — a 35B mixture-of-experts, a 122B MoE, and a 27B dense — across ROCm 7.13 and Vulkan 1.3 RADV backends, with and without MTP, at three prompt lengths plus a full-context decode (76k prompt tokens, 5k output). The 35B MoE on ROCm without MTP fell from 46.2 tok/s at empty context to 16.6 tok/s at full context. Switching on MTP brought it back to 37.5 tok/s — still below the empty-context 63.7 tok/s MTP ceiling, but usable. Vulkan non-MTP on the same model managed 28.9 tok/s at full context, down from 32.7 tok/s empty, and Vulkan MTP reached 34.3 tok/s.

Full-context performance across model sizes

The 122B MoE flipped the script. Vulkan non-MTP dropped only 12 percent at full context (23.7 tok/s), while ROCm MTP fell 38 percent to 19.2 tok/s. Vulkan MTP improved slightly over non-MTP, landing at 21.9 tok/s with a 6 percent drop. The dense 27B model proved unusable across all backends, bottoming out at 6–9 tok/s because it processes 27 billion active parameters per token instead of the MoE's 3 billion.

ROCm held a 2.3× lead over Vulkan at empty context (46 vs 32 tok/s on the 35B), but that advantage shrank to 1.3× at full context once the memory subsystem saturated. The setup used ROCm 7.13 with the therock-gfx1151 codegen path, Vulkan 1.3 RADV, and llama.cpp build 9188. BF16 models failed at full context on Strix Halo; the benchmarks ran Q8 quantization for the 35B and Q4 for the 122B. For production use on a sub-100W budget, ROCm MTP on the 35B MoE delivers 37.5 tok/s at full 262k context. Quality-focused users running the 122B on Vulkan get 23–24 tok/s with minimal context penalty.

Full-context performance across model sizes

More in Platform