Qwen 3.6 35B sustains 200k tokens on Q4_0 KV quantization in production coding
A developer running Qwen 3.6 models on 32GB AMD hardware reports stable performance up to 200k tokens using Q4_0 KV quantization, with degradation only appearing near the upper limit.
A developer stress-testing Qwen 3.6 models in a 20-file codebase reports that Q4_0 KV cache quantization held up cleanly through 200,000 tokens before hitting performance limits. The test ran on a 32GB AMD GPU using llama.cpp with Vulkan, alternating between Qwen 3.6 27B dense and 35B MoE checkpoints for multi-file coding tasks.
The developer let a single chat thread run unchecked to surface any context-handling errors or logic loops. The session stayed stable and mistake-free through the sub-100k range, where most practical work happens. Around 200k tokens the model slowed noticeably and began dropping API calls, though it remained technically functional. The developer concluded that sub-100k context windows are the sweet spot for real work, but the 200k ceiling is "fucking impressive" for an open-weight model running locally.
What stands out
- 01Q4_0 KV quantization halves VRAM usage with no reported quality drop below 100k tokens. The developer saw no mistakes or logic loops in a complex multi-file codebase until approaching the 200k mark.
- 02Degradation appears only near the upper context limit. At 200k tokens (with 73k available space remaining in a 250k buffer), the model slowed and began failing API calls. Below that threshold, performance was stable.
- 03Qwen 3.6 35B MoE handled 20+ files without context collapse. The developer ran the test on a real codebase, not synthetic benchmarks, and the model maintained coherence across files through the practical working range.
- 04AMD Vulkan backend on llama.cpp is production-viable for 32GB cards. The developer runs this stack daily and reports consistent results with both dense and MoE Qwen variants.
