Single neuron suppresses safety alignment across 7 LLMs up to 70B
Researchers demonstrate that targeting individual neurons can bypass safety alignment in language models without training or prompt engineering.
A new preprint shows that safety alignment in large language models depends on individual neurons that can each be targeted to disable refusal behavior. Researchers Hamid Kazemi, Atoosa Chegini, and Maria Safi identified two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful knowledge itself. By suppressing a single neuron in each system, the team bypassed safety alignment across seven models spanning two families and 1.7B to 70B parameters, without any training or prompt engineering. The inverse also held—amplifying concept neurons induced harmful content from innocent prompts that would normally produce safe outputs.
The findings challenge the assumption that safety tuning distributes robustly across model weights. Instead, suppressing any one identified refusal neuron bypassed safety alignment across diverse harmful requests. For the open-weight AI ecosystem, this suggests that models released with safety alignment—whether through reinforcement learning from human feedback or supervised fine-tuning—may be more brittle than previously understood. The consistency of the single-neuron effect across the 1.7B-to-70B parameter range suggests the vulnerability is architectural rather than an artifact of model size. The preprint (arXiv 2605.08513) was posted May 12, 2026.
