UCCI cuts LLM cascade inference cost 31% with calibrated uncertainty routing
New arXiv preprint shows isotonic-regression router achieves cost-optimal escalation on 75,000-query NER workload, beating entropy and FrugalGPT baselines on real H100 latency.
Cascade routers that send easy queries to small models and escalate hard ones to large models promise lower inference cost, but most deployed systems rely on raw confidence scores and require per-workload threshold tuning. A new preprint proposes treating confidence calibration as a first-class problem instead.
UCCI (Uncertainty-Calibrated Cascade Inference) maps token-level margin uncertainty to a per-query error probability via isotonic regression, then picks the escalation threshold by constrained cost minimization. Researchers tested it on a production named-entity-recognition workload of 75,000 queries served by 4B and 12B instruction-tuned models on H100 GPUs. The system cut inference cost by 31 percent (95% CI: 27–35%) at micro-F1 = 0.91 while shrinking expected calibration error from 0.12 to 0.03. At the same operating point, UCCI beat entropy thresholding, split-conformal routing, and a FrugalGPT-style learned threshold.
The method works by converting the small model's margin—the difference between top-two logits—into a probability that the small model's answer is wrong. The router then solves a constrained optimization: minimize expected cost subject to an accuracy floor, using the calibrated error probabilities as weights. Under three explicit assumptions (independent errors, Lipschitz cost functions, and bounded query distributions), threshold policies on the calibrated score are cost-optimal. Isotonic calibration achieves O(n^{-1/3}) sample complexity for expected calibration error.
All cascade results use end-to-end routing on actual model outputs and measured H100 latency, not simulated routing from global accuracies or nominal API prices. The preprint was posted to arXiv on May 20, 2026.
