Manifestation Units protocol makes neural network analyses queryable and reusable
New arXiv preprint introduces typed tuple schema that converts per-study mechanistic interpretability findings into structured, composable records across vision and language models.

Mechanistic interpretability research has built a detailed catalog of how neural network components encode information and interact, but those findings typically stay trapped in individual study notebooks as selectivity tables, circuit diagrams, and feature lists that can't be queried or reused.
A preprint posted to arXiv on July 2 introduces Manifestation Units, a typed tuple protocol designed to bridge that gap by organizing per-component statistics into structured fields that downstream tools can actually work with. The schema defines six fields: E (entity), S (selectivity), R (role), D (distribution), G (geometry), and T (transformer-specific attention-head primitives). The authors tested the protocol across three architectures — beta-VAE for generative vision, a CNN for discriminative vision, and GPT-2 for language — and found that typed structure substantially outperformed unstructured baselines on retrieval tasks.
Validation across architectures
CNN filters retrieved using the schema satisfied causal sufficiency and necessity criteria under matched-budget controls, meaning the system could identify components that actually matter for a given behavior. When applied to GPT-2, the protocol recovered known members of the Indirect Object Identification (IOI) circuit under retrieval-budget-matched controls, validating that it can surface components already documented in the literature. The authors also identified an irreducible two-field core — selectivity (S) plus role (R) — with the remaining fields either redundant or actively interfering with retrieval quality. The T field for attention heads integrated without requiring changes to the base schema.
The paper frames this as infrastructure work rather than frontier-scale validation. By making mechanistic interpretability findings composable and queryable in natural language, the protocol turns per-study outputs into records that audit and intervention tools can consume directly. The authors instantiated it automatically across the three test architectures, suggesting it could scale to larger model families without manual annotation per component.



