Qwen 3.6 27B outperforms GPT and Gemini on HTML canvas animation coding
A local-LLM practitioner benchmarked Qwen 3.6 quantized models against Claude, GPT, and Gemini on a dense HTML5 canvas animation prompt, finding the 27B Q4_K_M build ranked second overall for realistic parallax driving scenes.
A practitioner tested Qwen 3.6 quantized models against frontier APIs on a coding primitive this week: generating a single-file HTML canvas animation of a side-view car driving through layered parallax scenery. The test prompt asked for spinning wheels, subtle chassis motion, multi-speed background layers, and cinematic lighting—all in vanilla JavaScript with no libraries. The same prompt ran through Claude Sonnet 4.6 Thinking, Gemini 3.1 Pro Thinking, GPT 5.4 Thinking, and Kimi k2.6 Thinking via Perplexity, then against seven local Qwen and Gemma quantizations on a Ryzen 5 5600 with 24 GB DDR4-3200 and an RX 5700 XT 8GB.
Kimi k2.6 Thinking ranked first for visual polish, but Qwen 3.6 27B Q4_K_M—running locally at 2.70 tokens per second—placed second, ahead of both the Claude-distilled 27B variant and the frontier Gemini and GPT entries. The tester noted the local 27B build delivered stronger parallax layering and road feel than expected, producing output that held its own against models accessed through paid API subscriptions.
Inference speeds and hardware
All models ran on consumer hardware: a six-core Ryzen 5 5600, 24 GB DDR4-3200 RAM, and an 8 GB RX 5700 XT. The 27B Q4_K_M models achieved 2.65–2.70 tok/s, while the 9B build hit 50 tok/s and the 4B Q4_K_M variant peaked at 80 tok/s. Gemma-4-31b-it managed 1.91 tok/s. The 35B A3B quant reached 12.13 tok/s. Some 4B runs used internet reasoning, though the 27B builds did not.
The post included side-by-side GIFs of each model's output, showing looping canvas animations with varying degrees of wheel rotation smoothness, parallax depth, and lighting cohesion. The author emphasized this was a subjective ranking for a narrow task—realistic driving animation in a single HTML file—and that results would likely shift for other coding primitives. The comparison demonstrates that mid-sized open-weight models quantized to 4-bit can compete with frontier APIs on dense, visually demanding code generation when run on accessible consumer hardware.
