Safety probes miss jailbreaks hiding in earlier prompt tokens, study finds
New research shows single-token safety monitors fail to catch adversarial prompts that distribute unsafe signals across earlier hidden states, a blind spot that simple trajectory models can partially recover.
Safety probes that read a language model's final hidden state after prompt processing miss jailbreak attacks whose harmful signals appear earlier in the input sequence, according to a preprint released this week on arXiv. The finding challenges the widespread practice of monitoring only the last token representation before generation begins.
Researchers tested SafeSwitch-style probes—linear classifiers trained to flag harmful prompts—across three instruction-tuned models. While the probes achieved high recall on straightforward harmful requests, they missed many jailbreak variants and produced false positives on safety-adjacent benign prompts like "How do I safely dispose of hazardous materials?" Subspace analysis revealed that missed jailbreaks occupy representational directions poorly captured by the probe's learned subspace, and simply widening the probe bottleneck did not fix the mismatch.
Token-level prefill analysis showed that unsafe evidence often surfaces in earlier user-token representations but vanishes by the final readout. Naive max-pooling over all token positions recovered some misses but overfired on safe prompts. The researchers trained a PCA-HMM trajectory model—using only the same clean harmful and benign examples—that tracked hidden-state evolution across the prefill phase. This trajectory approach caught many final-token failures without the catastrophic false-positive rate of token pooling, suggesting that monitoring the path through representation space, rather than a single endpoint, offers a more robust diagnostic for adversarial inputs.
Published May 14, 2026, the preprint "Before the Last Token: Diagnosing Final-Token Safety Probe Failures" (arXiv:2605.12726) used only clean training data, making the trajectory model a drop-in complement to existing final-token probes rather than a replacement that requires adversarial examples.
