Loading…

llama.cpp KV cache overflow forces 40k-token reprocessing in long coding sessions | UncensoredHub

Community

llama.cpp KV cache overflow forces 40k-token reprocessing in long coding sessions

Users running llama.cpp with opencode and pi.dev report sudden context checkpoint rollbacks that force multi-minute prefills even when prompt similarity exceeds 0.99, despite cache-reuse and checkpoint settings.

May 15, 2026

llama.cpp KV cache overflow forces 40k-token reprocessing in long coding sessions

A developer running long-context coding agents with llama.cpp, opencode, and pi.dev is seeing sudden, massive prompt reprocessing events that push time-to-first-token into multiple minutes. The issue surfaces when context grows past 50,000 tokens: despite log-confirmed prompt similarity above 0.99, the KV cache occasionally rolls back to around 4,750 tokens and reprocesses the entire 40,000+ token gap. One logged event showed 222 seconds to eval 44,016 tokens, compared to 473 milliseconds for 19 tokens during normal reuse.

The setup runs llama-server with a 150,000-token context window, 32 context checkpoints, 2,500 MiB cache-ram limit, cache-reuse set to 256, and both KV unload and context-shift disabled. Cache state logs show 4,676 MiB in use against the 2,500 MiB limit, suggesting the cache is overflowing and triggering eviction. The config sets --parallel 1 to avoid multi-request interference.

Cache pressure and checkpoint eviction

The 2,500 MiB cache-ram setting appears undersized for the 150k context window. At 4,676 MiB actual usage, the cache is running nearly double its configured limit, which likely forces llama.cpp to discard older checkpoints and fall back to the earliest valid state—around 4,750 tokens in this case. The n_past value suddenly dropping from 50k+ to sub-5k suggests a checkpoint eviction rather than a prompt mismatch, since similarity remains above 0.99.

The underlying cause could be cache invalidation logic, poor KV reuse heuristics, or the coding agent modifying early prompt tokens too frequently. If opencode rewrites system instructions or imports on each request, it would invalidate the prefix and force a full reprocess. Practitioners running similar long-context coding workflows with llama.cpp may want to raise --cache-ram to at least match actual usage, reduce --ctx-checkpoints to prioritize fewer but larger snapshots, or audit whether the agent framework is mutating the prompt prefix unnecessarily.

Cache pressure and checkpoint eviction

More in Community