WhichLLM overestimates VRAM for 20B+ models on low-resource hardware

A developer testing WhichLLM for internal tooling found the utility recommending 20B and 27B parameter models on machines with 4–6GB VRAM, raising questions about the tool's accuracy for resource-constrained deployments.

May 18, 2026

WhichLLM overestimates VRAM for 20B+ models on low-resource hardware

A developer building internal CLI tools on work laptops with 4–6GB VRAM turned to WhichLLM to identify which models would run locally and found recommendations that didn't match reality. The tool flagged both a 20B parameter model and Qwen 3.6 27B as viable candidates despite the developer successfully running only Qwen 2.5 Coder Instruct 3B in practice. The discrepancy surfaced while the team explored alternatives to Ollama for developer tooling, marketing automation, and factory production systems.

WhichLLM appears to calculate model compatibility based on theoretical memory footprints—parameter count multiplied by quantization bit depth—without accounting for context window overhead, KV cache growth, or the difference between inference and loading requirements. A 20B model quantized to 4-bit still occupies roughly 10GB on disk and requires additional VRAM for activations during inference, making it a poor fit for 4–6GB cards. The tool's RAM and disk capacity readings also misreported the underlying system, likely because the developer was running Linux inside WSL2, which presents virtualized resource limits to userspace utilities.

The gap between WhichLLM's output and real-world constraints highlights a broader challenge for practitioners selecting models for resource-constrained deployments. Static calculators that ignore runtime memory pressure, batch size, and prompt length tend to overestimate what fits on consumer GPUs. Tools that profile actual inference runs—measuring peak VRAM during a representative workload—remain more reliable than disk-space heuristics, especially when quantization formats like GGUF or AWQ introduce variable memory behavior across backends. Until then, developers sizing models for low-VRAM hardware should benchmark candidates directly rather than relying on static compatibility checks.

More in Industry