MobileMoE scales sparse experts to smartphones, cuts inference by 2–4×

Researchers identify a memory- and compute-optimal Mixture-of-Experts design for smartphones, delivering faster prefill and decode than dense baselines at comparable INT4 memory.

ByAlex Sokoloff·May 27, 2026

MobileMoE scales sparse experts to smartphones, cuts inference by 2–4×

Mixture-of-Experts architectures have scaled language models to hundreds of billions of parameters, but their utility at sub-billion scales for on-device deployment has remained largely untested—until now.

MobileMoE, a family of on-device MoE language models developed by researchers including Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai, and Zechun Liu, operates with 0.3–0.9B active parameters and 1.3–5.3B total parameters. The team formulated an on-device MoE scaling law that jointly optimizes architecture under mobile memory and compute constraints, identifying a sweet spot of moderate sparsity with fine-grained and shared experts that is simultaneously memory- and compute-optimal for smartphones. The preprint, posted to arXiv on May 27, describes a four-stage training recipe—pre-training, mid-training, instruction fine-tuning, and quantization-aware training—all on open-source datasets.

Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2–4× fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE model OLMoE-1B-7B with up to 60% fewer parameters. At comparable INT4 weight memory, MobileMoE-S delivers 1.8–3.8× faster prefill and 2.2–3.4× faster decode than the dense baseline MobileLLM-Pro. The paper includes the first efficient MoE inference implementation on commodity smartphones with comprehensive on-device profiling, bridging the gap between research and real-world mobile deployment.

ZenCreator

MobileMoE scales sparse experts to smartphones, cuts inference by 2–4×

More in Releases

Apple accuses OpenAI of soliciting hardware prototypes in job interviews

Lightweight proxy models cut LLM post-training costs while enabling cross-model signal reuse

Colibri runs 744B GLM-5.2 on 25GB RAM by streaming experts from disk

Anthropic extends Fable 5 preview a second week, bumps rate limits 50%

Soofi S 30B activates 3B parameters per token, tops European AI baselines