Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. Project page: https://fsoft-aic.github.io/fsoft-LibMoE.github.io.
翻译:专家混合(MoE)架构已成为模型规模扩展的基石,是GPT-OSS、DeepSeek-V3、Llama-4和Gemini-2等主流大规模语言模型的核心组件。然而,由于训练与评估所需的巨大计算成本,针对MoE的系统性研究仍受到严重制约,使得大多数研究者难以开展大规模研究。本文提出LibMoE——一个支持预训练与稀疏升级训练模式、具备可复现性、高效性与可扩展性的统一MoE研究框架。该框架不仅提供标准化实现,还配备了透明的分析工具以探究路由机制与专家动态。基于此框架,我们开展了三个维度的系统性分析:(i)路由动态:涵盖专家选择模式、路由稳定性与最优性,以及路由熵如何揭示任务专业化与专家多样性;(ii)轻量化初始化对负载均衡的影响:展示路由器初始化的细微调整如何影响早期专家利用率;(iii)训练机制差异:揭示稀疏升级训练与完整预训练在路由模式与稳定性特征上的显著区别。通过降低研究门槛、标准化评估体系以及系统性分析,LibMoE拓宽了MoE研究的可及性,并为未来创新建立了可靠的基准。项目主页:https://fsoft-aic.github.io/fsoft-LibMoE.github.io。