Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.
翻译:混合专家模型(MoE)已成为千亿参数语言模型的事实标准架构,但它在十亿参数以下规模用于端侧部署的优势仍鲜有探索。为弥补这一空白,我们提出MobileMoE——一系列活跃参数低于十亿(活跃参数0.3-0.9B,总参数1.3-5.3B)的端侧MoE语言模型,为端侧大语言模型建立了新的帕累托前沿。我们首先推导出端侧MoE缩放定律,在移动内存与计算约束下联合优化MoE架构,并确定了端侧"甜区"——兼具内存与计算最优性的中等稀疏度与细粒度共享专家。基于所推架构,我们采用四阶段训练方案训练MobileMoE,涵盖预训练、中期训练、指令微调及量化感知训练,所有阶段均基于开源数据集完成。在14个基准测试中,MobileMoE以2-4倍更少的推理FLOPs达到或超越领先的端侧密集大语言模型,并以最多60%的参数压缩量达到或超越当前最优MoE模型OLMoE-1B-7B。为打通移动部署的最后一公里,我们首次在商用智能手机上实现了高效MoE推理,并提供了全面的端侧性能剖析。在可比的INT4权重内存占用下,MobileMoE-S相较密集基线模型MobileLLM-Pro实现1.8-3.8倍更快的预填充速度与2.2-3.4倍更快的解码速度。