Steered LLM activations cannot be reproduced by any text prompt, formal proof shows
A new preprint establishes that white-box activation steering pushes language models into behavioral states unreachable by discrete prompts, separating interpretability claims from real-world vulnerability.

Activation steering—the technique of nudging a language model's internal states to change its behavior—operates in a fundamentally different space than text prompts, according to a formal proof from researchers at Johns Hopkins and the Allen Institute for AI.
The preprint, published on arXiv on May 18, 2026, shows that almost surely no prompt can reproduce the same internal behavior induced by steering. The authors cast the question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model's natural forward pass? Under practical assumptions, the answer is no. Steering pushes the residual stream off the manifold of states reachable from discrete text inputs.
The researchers validated the finding empirically across three widely used LLMs. In each case, steered activations fell outside the distribution of states produced by natural prompts. The gap held across steering magnitudes and model scales.
This result formalizes what many practitioners have suspected: that white-box control techniques can elicit behaviors that no discrete prompt could ever trigger. A model may be trivially steerable via direct activation edits while remaining robust to adversarial prompts. The authors recommend evaluation protocols that explicitly separate white-box interventions—where an attacker has full access to weights and activations—from black-box prompting scenarios.
The work has immediate implications for interpretability research that uses steering to probe model internals. If steered states are unreachable by prompts, then conclusions drawn from steering experiments may not generalize to real-world prompt-based interactions. The same caution applies to safety work that uses steering to test jailbreakability: success under white-box control does not imply prompt-based risk.