DeepSeek V4 holds 250k tokens reliably, degrades past 300k in production code tests
A developer stress-tested DeepSeek V4's million-token context window across three real codebases and found precision drops sharply beyond 300k tokens, with optimal performance in the 150-250k range.

DeepSeek V4's million-token context window holds up well through mid-sized codebases but starts losing precision past the 300k mark, according to production tests run this week across three real-world repositories. A developer pushed the model through dependency tracing, cross-file refactors, and bug isolation tasks on codebases ranging from 45k to 520k tokens to see where recall breaks down.
Below 150k tokens, the model delivered solid results — at 45k it traced function calls across eight files without error, and at 180k it handled multi-file refactors spanning 14 files with consistent architectural understanding and no contradictions. Past 300k tokens, precision degraded noticeably. Asked for exact line numbers from functions defined 400k tokens earlier, the model responded with approximations like "around line 230" instead of the actual line 247. At 520k tokens, outputs shifted to high-level architectural summaries that skipped implementation details, a problem when edge cases matter.
Latency also climbs in extended contexts. Time to first token measured around 1.19 seconds on DeepInfra's fp4 endpoint, but time to first answer in max reasoning mode stretched to 120 seconds as the model completed its internal chain of thought before producing visible output. The tester noted a 94 percent hallucination rate on unknown-answer tasks, with the model generating confident responses about nonexistent utility functions or phantom dependencies when it lacked actual grounding.
The practical sweet spot appears to be 150-250k tokens, where full context retention, sub-two-second response latency, and minimal precision loss align. The million-token window functions technically, but it shifts the burden from context limits to prompt engineering and validation layers. Whether future releases can flatten the precision curve past 300k or whether that range will remain the domain of defensive prompting and manual source verification remains to be seen.