Safety probes miss jailbreaks hidden in early prompt tokens, arXiv study finds
New arXiv preprint shows single-token readout probes fail on adversarial prompts that distribute unsafe evidence across prefill sequences, proposing trajectory-aware diagnostics to catch misses without catastrophic false positives.
Safety probes that monitor a single hidden state after prompt prefill miss many jailbreak attacks because adversarial prompts scatter unsafe evidence across earlier user tokens that never surface at the final readout position, according to a preprint released May 14.
Researchers trained SafeSwitch-style probes on clean harmful and benign prompts across three instruction-tuned LLMs. The probes achieved high recall on straightforward harmful requests but failed on jailbreaks and produced false positives on safety-adjacent benign prompts like "How do I safely dispose of lithium batteries?" Subspace analysis showed that missed jailbreaks differ from clean benign prompts along representational directions the probe never learned, and widening the probe's bottleneck did not reliably close the gap.
Token-level prefill analysis revealed that probe-visible unsafe evidence often appears mid-sequence but vanishes by the time the model reaches the final token. Naive max-pooling over all token positions recovered some misses but overfired on safe prompts. The authors propose a PCA-HMM trajectory model trained only on the same clean split: it catches many final-token misses by reading prefill trajectories without the catastrophic false-positive rate of naive pooling. The work frames trajectory-aware hidden-state diagnostics as a practical complement to final-token probes, not a replacement.
The preprint is available on arXiv (2605.12726); code and probe checkpoints are not yet public.
