EvalShift CLI catches silent quantization breakage in tool calls and JSON output
New open-source regression-testing tool compares LLM behavior across quantization levels, catching subtle breakage in structured output and function calling that breaks downstream code.

EvalShift, an MIT-licensed command-line tool for regression-testing LLM model changes, is expanding to quantization testing after community feedback revealed a gap in local model validation. The tool runs the same prompt suite against two versions of a model—say, Q8 and Q4_K_M quantizations of the same base weights—then generates an HTML report showing where behavior diverged.
The developer is seeking input on which model and quant pair to test first, which inference backend to prioritize (Ollama, llama.cpp, or vLLM), and which failure mode matters most to practitioners. The repo is local-first with no backend, no accounts, and no telemetry.
What the tool detects
- 01Invalid JSON and structured output — catches when a lower quant stops emitting parseable data, a common silent failure in production pipelines that rely on strict schemas.
- 02Changed tool or function selection — detects when a model switches which function it calls under quantization, even if the output looks reasonable in isolation.
- 03Mutated tool arguments — flags when function parameters shift (wrong types, missing keys, hallucinated values) between quant levels, breaking API contracts.
- 04Skipped retrieval or instruction steps — identifies when a quantized model drops a retrieval call or ignores part of a multi-step instruction that the higher-precision version followed.
- 05Plausible output that breaks code — surfaces responses that read fine to a human but fail schema validation, type checks, or downstream logic—the hardest class of regression to catch without automated comparison.