Community

Qwen2.5-Coder and Qwen3.6-35B stack fits dual coding agents on RTX 5080

A developer configured autocomplete and agentic coding on a single RTX 5080 with 64GB RAM using Qwen2.5-Coder-7B at Q6_K_L for infill and Qwen3.6-35B-A3B at Q8_K_XL for multi-step tasks.

May 13, 2026

Qwen2.5-Coder and Qwen3.6-35B stack fits dual coding agents on RTX 5080

A developer shared a working dual-model coding setup on a single RTX 5080 (16GB VRAM) with 64GB system RAM. The configuration pairs Qwen2.5-Coder-7B-Instruct at Q6_K_L quantization for autocomplete with Qwen3.6-35B-A3B at Q8_K_XL for agentic tasks. The 7B model consumes roughly 8GB VRAM and delivers instant autocomplete suggestions; the 35B-A3B model fits the remaining 8GB VRAM thanks to its 3B active parameter count, offloading MoE experts to system RAM via llama.cpp's --cpu-moe flag.

Q8 quantization is mandatory for reliable agentic behavior on the 35B-A3B model. At Q4, the model "gets lost a lot" and fails to complete multi-step tasks. Q6_K is a fallback for systems with less RAM, but lower quantizations show noticeable quality degradation. The user recommends 64GB system RAM minimum; their own 96GB setup leaves headroom—total memory usage sits at 56GB with both models running alongside browser, IDE, and Teams. Qwen2.5-Coder remained the preferred infill model after testing Gemma4 E4B and Qwen3.5 variants, both of which produced "weird suggestions." The 35B-A3B model delivers 35.29 tokens per second for generation and 2093.93 tokens per second for prompt processing at Q8_K_XL, with context autofitting to roughly 145k tokens.

More in Community