Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.
翻译:专家混合(MoE)因其独特性质和卓越性能(尤其在语言任务中)正受到越来越多的关注。通过稀疏激活每个词元的参数子集,MoE架构能够在保持计算效率的同时扩大模型规模,实现性能与训练成本间更优的权衡。然而,MoE的内在机制仍需深入探索,其模块化程度亦存疑。本文首次尝试理解基于MoE的大语言模型内部工作机制。具体而言,我们系统研究了三种近期MoE模型的参数特征与行为特性,揭示了若干重要发现:(1)神经元表现出细粒度专家特性;(2)MoE路由器倾向于选择输出范数较大的专家;(3)专家多样性随网络层数增加而提升,但末层呈现异常。基于这些发现,我们为MoE实践者提供了路由器设计与专家分配等维度的改进建议。本研究有望为MoE框架及其他模块化架构的未来研究提供启示。代码已发布于https://github.com/kamanphoebe/Look-into-MoEs。