OutputGuard maps seven JSON failure modes across 288 local LLM calls
A developer tested structured output across Llama, Mistral, Qwen, DeepSeek, and Command R, finding identical failure modes in open and closed models—markdown fences, trailing commas, Python literals, truncation—then built a repair library with 15 ordered strategies.
A developer running structured output prompts through 288 model calls on OpenRouter has catalogued the ways local and closed-source models break JSON, finding that open-weight models fail in the same ways as API-only systems—just at different rates. The study covered Llama 3, Mistral, Command R, DeepSeek, Qwen, and other models available through the OpenRouter API.
The seven most common failure modes, in descending order: markdown code fences wrapping the JSON (the model attempting to be helpful), trailing commas likely inherited from JavaScript training data, Python literals (True/False/None) instead of JSON booleans and null, truncated objects from token limits, unescaped quotes inside string values, inline comments (// or #), and literal ellipsis (...) where the model declined to generate complete data. The rate varies—some models fence nearly every response, others only under specific prompt phrasing—but the categories remain consistent across vendors.
Repair strategies and ordering
The developer built outputguard, a Python library that validates against JSON Schema and runs 15 repair strategies in a specific sequence when parsing fails. The ordering matters: encoding fixes run before structural repairs, and the parser re-runs between each strategy to prevent later fixes from undoing earlier ones. The library also handles YAML, TOML, and Python literals, which appeared more frequently than expected once models without JSON mode began outputting in arbitrary formats.
JSON mode and constrained grammars help when available, but many local setups lack reliable JSON mode, and grammar-based generation carries speed and compatibility tradeoffs—schema violations and truncation can still occur even when syntax is valid. OutputGuard passed 2,001 tests, ships under an MIT license, and has no dependencies on LLM providers. Installation: pip install outputguard. Full findings, including prompt phrasing that triggered specific failures, are documented in a .
