Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token computation, yet deployment still requires storing the full expert pool, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert-pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, making pruning decisions sensitive to calibration-data variation while introducing substantial preprocessing cost. We propose AIMER (\textbf{A}bsolute mean over root mean square \textbf{IM}portance for \textbf{E}xpert \textbf{R}anking), a simple calibration-free criterion that identifies more distinct experts by capturing the concentration pattern of expert weights, making it well suited for task-agnostic expert pruning. Across 7B to 47B MoE language models with distinct architectures and 16 diverse benchmarks, AIMER consistently delivers stronger capability balance across diverse tasks than existing calibration-free methods. Surprisingly, AIMER also achieves better balance than strong calibration-based expert-pruning baselines calibrated on the widely used task-agnostic C4 corpus, while requiring only 0.22--2.06 seconds to score all experts.
翻译:摘要:混合专家(Mixture-of-Experts, MoE)语言模型在无需按词元比例增加计算量的前提下提升了参数容量,但其部署仍需存储全部专家池,这使得专家剪枝成为减少内存与服务开销的关键技术。现有任务无关的专家剪枝方法通常依赖校准集:它们通过校准集上的路由或激活统计量估计专家重要性,导致剪枝决策对校准数据变化敏感,同时引入大量预处理成本。本文提出AIMER(基于均方根绝对均值的重要性排名),这是一种简单的免校准准则,通过捕捉专家权重的集中模式识别更具区分度的专家,特别适用于任务无关的专家剪枝。在7B至47B参数规模、不同架构的MoE语言模型及16个多样化基准测试中,AIMER在多任务能力平衡性上持续优于现有免校准方法。令人惊讶的是,即使与基于广泛使用的任务无关C4语料库校准的强基线方法相比,AIMER在平衡性上也表现更优,且对所有专家评分仅需0.22至2.06秒。