Qwen 3.5B outperforms Gemma 4 27B in agentic coding on dual-GPU setup
A user reports that Qwen 3.5B running in Q8_0 quantization on a 4090 + 5060 Ti delivers stronger agentic coding performance than Gemma 4 27B, particularly when paired with Claude Code pointing to a local llama.cpp server.

Qwen 3.5B, Alibaba's compact instruction-tuned model, is showing unexpected strength in agentic coding workflows when run locally at Q8_0 quantization. The model outperforms Google's Gemma 4 27B in demo and data analytics tasks despite its smaller parameter count, when paired with Claude Code and a local llama.cpp backend.
The setup uses a 4090 and 5060 Ti to handle the model and a 262,144-token KV cache, both quantized to Q8_0. Qwen 3.5B delivers cleaner results in agentic mode—where the model iterates on code through a tool-calling interface—than in direct chat, where its output tends to be clunkier. The model hasn't been tested on large codebases yet, but the early results suggest it punches above its weight class for structured coding tasks.
In agentic workflows, the model receives structured tool calls and returns code that gets executed in a sandboxed environment, with errors fed back for iteration. That loop appears to play to Qwen 3.5B's strengths in a way that open-ended chat prompts don't. The model's instruction-following improves when constrained by a tool schema rather than free-form conversation. Qwen 3.5B runs on consumer hardware with 24GB VRAM when quantized to Q8_0, making it accessible to hobbyists and small teams who can't afford multi-GPU clusters or API costs.