Mixture-of-Experts (MoE) language models reduce per-token computation through sparse expert activation, yet deployment still requires storing the full expert pool, making one-shot expert pruning a practical approach for reducing memory usage. Although effective, existing criteria are largely heuristic, and no single criterion is universally optimal. Thus, establishing a principle for selecting pruning criteria suited to different deployment objectives remains an important yet largely underexplored problem in one-shot expert pruning. To this end, we introduce a unified formulation for one-shot MoE expert pruning organized around three factors: routing frequency, gate weighting, and activation strength. The formulation yields a criteria selection principle: task-agnostic pruning should favor routed-token-averaged, gate-free activation-based criteria, whereas task-specific pruning can benefit from retaining routing-frequency and gate-weight information. Beyond this principle, the formulation also provides a systematic view of existing heuristic criteria and gives rise to two new task-agnostic criteria, Mean Activation Norm (MAN) and Mean Squared Activation Norm (MSAN). Across four representative MoE models and 16 diverse benchmarks, MAN and MSAN are consistently strong in the task-agnostic setting, obtain the top-two average ranks, and improve average performance by up to 8.8 points over the strongest baseline.
翻译:混合专家(MoE)语言模型通过稀疏专家激活减少每词元的计算量,但部署时仍需存储全部专家池,因此单次专家剪枝成为降低内存占用的实用方法。尽管现有准则有效,但多为启发式方法,且无单一准则具备普适最优性。因此,针对不同部署目标建立剪枝准则的选择原则,仍是单次专家剪枝中重要但尚未充分探索的问题。为此,我们提出围绕路由频率、门控权重和激活强度三个因素组织的单次MoE专家剪枝统一框架。该框架推导出准则选择原则:任务无关剪枝应优先采用基于路由词元平均的无门控激活准则,而任务特定剪枝可受益于保留路由频率和门控权重信息。除该原则外,框架还系统化解释了现有启发式准则,并衍生出两种新的任务无关准则——平均激活范数(MAN)和均方激活范数(MSAN)。在四个代表性MoE模型和16个不同基准上的实验表明,MAN和MSAN在任务无关场景中持续表现优异,获得前两位平均排名,且相较最强基线将平均性能提升高达8.8个点。