Massive Emergence Layer pinpoints where LLM activations spike

New paper identifies a specific layer in large language models where massive activations first appear, then propagate through residual connections — and proposes a training-free fix that improves instruction following and math reasoning.

May 13, 2026

Massive Emergence Layer pinpoints where LLM activations spike

Researchers have traced the origin of massive activations in large language models to a single layer they call the Massive Emergence Layer (ME Layer). In a preprint posted May 13, Zeru Shi, Zhenting Wang, Fan Yang, Qifan Wang, and Ruixiang Tang show that this layer appears consistently across model families, and that both the RMSNorm and feed-forward network parameters inside it jointly trigger the phenomenon. Once massive activations form in that layer, they propagate to deeper layers through residual connections and remain largely unchanged — reducing the diversity of hidden representations fed into attention modules downstream.

The team's method selectively weakens the rigidity of massive activation tokens, and the paper reports consistent gains on instruction-following and math-reasoning benchmarks in both training-free and fine-tuning settings. The approach also mitigates attention sinks by reducing their influence at the hidden-state level, offering a new angle on why attention sinks form and how to address them without architectural surgery.

The findings suggest that massive activations are not a distributed emergent property but a localized one, traceable to a specific point in the forward pass. That localization opens the door to targeted interventions — pruning, rescaling, or architectural tweaks at the ME Layer — that could improve model efficiency and output diversity without retraining from scratch. The next step is to see whether the ME Layer location holds across newer architectures like Mamba or hybrid attention-MLP designs, and whether the proposed mitigation scales to models beyond the 70B parameter range tested in the paper. If it does, training recipes may soon bake in ME Layer regularization by default.

More in Releases