Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.
翻译:混合专家模型(Mixture-of-Experts, MoE)通过稀疏专家激活机制,在保持计算效率的同时有效扩展了模型容量。然而,从头训练高质量的MoE模型成本极高。一种可行的替代方案是将预训练的稠密模型转换为稀疏MoE。现有的稠密模型转MoE方法可分为两类:\textbf{动态结构剪枝}方法将稠密模型转换为具有适度稀疏度的MoE架构以平衡性能与推理效率,以及\textbf{降循环}方法利用预训练稠密模型初始化高度稀疏的MoE架构。然而,现有方法破坏了稠密模型内部的固有激活模式,导致专家构建效果欠佳。本研究提出,门控线性单元(Gated Linear Unit, GLU)机制为稠密模型向MoE转换提供了天然的蓝图。我们证明GLU的细粒度神经元级激活模式揭示了粗粒度结构,从而发现了一种由持续激活的通用神经元和动态激活的专用神经元构成的固有MoE架构。基于这一发现,我们提出了ExpertWeaver——一种免训练框架,该框架根据神经元激活模式进行分区,并通过层自适应配置构建共享专家与专用路由专家。实验表明,无论是作为免训练的动态结构剪枝技术,还是作为获得优质MoE初始化的降循环策略,ExpertWeaver均显著优于现有方法。