Apple researchers find safety in LLMs collapses to single neurons
A new preprint shows that safety alignment in large language models relies on isolated MLP neurons, making them vulnerable to white-box attacks that bypass guardrails entirely.
Apple researchers have demonstrated that safety alignment in modern large language models relies on individual, isolated neurons rather than distributed network-wide mechanisms. In a preprint published this week on arXiv, Hamid Kazemi, Atoosa Chegini, and Maria Safi show that manipulating a single "refusal neuron" can completely bypass safety guardrails, while amplifying a single "concept neuron" forces the model to generate harmful content in response to benign prompts.
The finding challenges the assumption that standard alignment methods like RLHF and supervised fine-tuning create robust, distributed safety systems. Despite millions of parameters involved in safety fine-tuning, the actual blocking mechanism for harmful requests collapses to a single point of failure. The same brittleness applies to harmful knowledge itself — dangerous capabilities are localized in specific concept neurons rather than erased or transformed by alignment training.
Neuron-level intervention
The team focused on MLP layers in transformer models, identifying neurons that activate strongly when the model refuses a harmful request. By intervening on these neurons during inference — either suppressing refusal neurons or amplifying concept neurons — they achieved near-complete safety bypass across multiple frontier and open-weight models. The attack requires white-box access but no gradient updates or retraining.
The paper argues that current alignment methods act as a fragile tripwire tied to a single component, rather than instilling genuine ethical reasoning or erasing dangerous knowledge. The authors call for new alignment paradigms that distribute safety knowledge across the network, making models resilient to targeted neuron-level manipulation. For practitioners running open-weight models locally, the findings underscore that standard fine-tuning approaches leave safety mechanisms architecturally brittle and exploitable at inference time.
