llama.cpp ROCm backend consumes 15% more VRAM than Vulkan on identical workloads
A user running llama.cpp in Docker reports ROCm consuming 29.1 GB VRAM versus 25.3 GB on Vulkan for the same 22.6 GB model with Q8_0 KV cache quantization, with no performance gain.

"ROCm is eating nearly 4 GB more VRAM than Vulkan for the same model and no speed benefit," a user running llama.cpp in a Docker stack observed after testing both backends on identical workloads.
The comparison ran a 22.6 GB model file with identical context length and Q8_0 KV cache quantization on both ROCm and Vulkan. ROCm consumed 29.1 GB of VRAM at idle—no prompt, no existing context, no system message—while Vulkan held steady at 25.3 GB. The 15 percent overhead appeared immediately after model load, before any inference began.
llama.cpp's ROCm backend uses AMD's HIP runtime to target Radeon GPUs, while the Vulkan path runs on the cross-platform graphics API. Both support the same quantization schemes and context sizes, but memory layout and driver behavior differ under the hood. The ROCm path may allocate larger scratch buffers or align tensors less tightly than Vulkan. For practitioners running multi-service AI stacks on AMD hardware, the Vulkan backend may remain the leaner choice until ROCm's memory footprint improves.