Local developer builds single-slot sub-agent fork for 10GB VRAM setups
A new GitHub repo lets VRAM-constrained users run sub-agents through pi coding agent without reprocessing prompts, targeting single-slot llama.cpp servers with 10GB VRAM.
"Local-first developers have long struggled to run multi-agent workflows on constrained hardware," and a new fork of the pi-subagent repository now addresses exactly that problem: enabling sub-agent task splitting on a single KV cache slot without reprocessing the full prompt after each agent completes.
The author, a developer who has access to GPT-5.4 and Sonnet at work but prefers to run models locally at home, built the fork using Qwen3.6-35B-A3B after finding that existing sub-agent implementations assume either multiple concurrent model instances or abundant VRAM. The pi-subagent repository on GitHub is designed specifically for users running pi coding agent as their harness and llama.cpp server with a single slot and 10GB of VRAM. The code is intentionally narrow in scope—it solves a specific problem for local-first practitioners who can't afford multi-instance setups—but fills a gap that doesn't exist elsewhere in the sub-agent ecosystem.
In testing, the Apex Qwen MTP variant delivers solid performance: 175–200k context windows with q_8 KV quantization, 200–300 tokens per second for prompt processing, and 25–40 tokens per second for generation, depending on draft hit rates. The developer plans to extend the fork with context-saving via llama.cpp's --slot-save-path and slots endpoint, though the resulting .bin files are large. The author is also interested in how other local practitioners have solved similar sub-agent constraints.
