ResearchNSFW

Intra-expert sparsity cuts MoE inference costs 2.5× without retraining

Researchers find up to 90% sparsity inside individual experts of pretrained Mixture-of-Experts models, achieving 2.5× faster layer execution in vLLM by skipping inactive neurons without accuracy loss.

May 13, 2026

Intra-expert sparsity cuts MoE inference costs 2.5× without retraining

A new preprint on arXiv demonstrates that pretrained Mixture-of-Experts models already contain massive sparsity within each expert—up to 90 percent of neurons stay inactive during inference—and that skipping those computations can accelerate serving by a factor of 2.5 at the layer level. The work, which tested eight off-the-shelf MoE models ranging from 1 billion to 400 billion parameters, extends the vLLM execution pipeline to exploit this "intra-expert activation sparsity" on top of the framework's existing optimizations, delivering a 1.2× end-to-end speedup over the dense baseline.

MoE architectures have become the default for state-of-the-art large language models because they activate only a subset of experts per token, cutting compute costs while scaling capacity. The conventional view treats each expert as an atomic unit—either it fires or it doesn't—but this paper argues that a second dimension of sparsity hides inside the experts themselves. The authors show that standard ReLU and GELU activations in pretrained models naturally zero out most intermediate neurons, and that selectively skipping those zeros during the forward pass yields substantial wall-clock savings without touching weights or activation functions. Importantly, the sparsity appears in models that were never trained with sparsity in mind, sidestepping the expert-collapse and load-imbalance problems that plague attempts to increase inter-expert granularity.

The vLLM modifications introduced in the paper identify inactive neurons at runtime and bypass the corresponding matrix multiplications, stacking the benefit on top of vLLM's existing kernel fusion and memory optimizations. Across the eight models tested—including both dense-gated and top-k routed architectures—the speedup was consistent, with the largest gains appearing in the widest experts where the absolute number of skipped operations is highest. The authors report negligible accuracy degradation on standard benchmarks, suggesting that the inactive neurons contribute little to the final output and can be pruned dynamically without retraining.

The next question is whether training MoE models with explicit intra-expert sparsity targets—structured pruning, learned masks, or sparsity-aware routing—can push utilization even lower and whether hardware vendors will add first-class support for fine-grained neuron skipping in their accelerators. For now, practitioners running large MoE models on vLLM can look for the extended pipeline in upcoming releases, and researchers chasing inference efficiency have a new lever to pull that doesn't require access to the original training run.

More in Research