Verifiable Transformers encode circuit explanations as formal solver-checkable claims
Researchers convert task-localized Transformer circuits into bounded SMT propositions, exhaustively verifying quote-closing and bracket-tracking circuits on small models and demonstrating stable GPT-2 scale training with SMT-encodable operators.

A preprint posted to arXiv this week introduces Verifiable Transformers, a framework that converts task-localized Transformer circuits into formal propositions a satisfiability-modulo-theories solver can check. Mechanistic interpretability typically identifies circuits inside Transformers through examples, ablations, and manual reasoning—but stops short of proof. The new framework closes that gap by encoding extracted circuits directly into an SMT solver, then verifying properties like projected functional equivalence, edge necessity, and final-residual robustness over a finite task domain.
The authors instantiate direct verification with a GPT-style architecture built from SMT-encodable primitives: Signed L1 BandNorm, sparsemax attention, and LeakyReLU activations. On small symbolic sequence tasks—quote closing and bracket type tracking—they train a Transformer whose weights and forward pass fit entirely inside the solver's representation. They extract sparse circuits for each task and exhaustively verify that the circuit produces the correct output for every possible input in the bounded domain, that each edge is necessary, and that task-irrelevant content changes leave predictions unchanged. The solver either confirms the claim or returns a counterexample.
Scaling and hard-to-encode operators
The same operator stack trains stably on OpenWebText at GPT-2 parameter count, though naive direct SMT verification remains intractable at that scale. For circuits that include operators the solver cannot encode exactly—standard softmax attention, for instance—the framework offers surrogate-mediated verification: fit an SMT-encodable surrogate to the extracted circuit, validate the surrogate against the real circuit over the bounded task domain, then verify symbolic explanations against the surrogate. The paper demonstrates both verified symbolic explanations and solver-generated counterexamples on task-localized circuits with hard-to-encode attention.
The authors are not attempting full-model verification. The goal is a concrete path from "this circuit does X" to a formal claim a solver can prove or refute, turning mechanistic interpretability findings into propositions with the same rigor as a type-checked proof.

