BEAM cuts MoE layer FLOPs by 85% with token-adaptive expert routing

New training method learns binary masks for dynamic expert selection in Mixture-of-Experts models, delivering 2.5× faster decoding without architectural overhaul or performance collapse.

May 15, 2026

BEAM cuts MoE layer FLOPs by 85% with token-adaptive expert routing

BEAM (Binary Expert Activation Masking) is a training method from researchers at Alibaba and Fudan University that replaces fixed Top-K expert routing in Mixture-of-Experts models with learned token-adaptive selection. The approach uses trainable binary masks and a straight-through estimator to induce dynamic sparsity during end-to-end training, sidestepping the train-inference mismatch that causes performance drops in existing acceleration techniques.

Mixture-of-Experts architectures have become a standard efficiency lever in large language models—activating only a subset of experts per token instead of running the full parameter set. But the conventional Top-K routing strategy activates the same number of experts for every token, regardless of whether a given input actually needs that much compute. That rigidity leaves performance on the table: some tokens could get away with fewer experts, while others might benefit from more. Existing methods that try to prune experts either demand expensive retraining with new architectures or crater in quality when sparsity climbs, because the model wasn't trained to handle the inference-time expert selection pattern.

BEAM solves both problems by learning the sparsity pattern end-to-end. During training, binary masks decide which experts fire for each token, guided by an auxiliary regularization loss that encourages dynamic selection. The straight-through estimator lets gradients flow through the discrete mask decisions. Because the model trains with the same sparsity it will use at inference, there's no train-test mismatch—and no quality cliff when you crank up the sparsity dial.

The team built a custom CUDA kernel that plugs directly into vLLM, the widely deployed open-source inference server. No fork, no architectural surgery. Experiments show BEAM retains over 98% of baseline model performance while cutting MoE layer FLOPs by up to 85%. Decoding speed improves by up to 2.5× and throughput by 1.4× compared to standard Top-K routing. The preprint, authored by Juntong Wu, Jialiang Cheng, Qishen Yin, Yue Dai, Yuliang Yan, and Fuyu Lv, was posted this week. Practitioners running open MoE models—DeepSeek, Qwen-MoE, Mixtral derivatives—can test BEAM without waiting for upstream adoption.

More in Releases