Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in Large language models (LLMs). Despite this reduction, the massive number of experts in MoEs still makes them expensive to serve. In this paper, we study how to address this, by pruning MoEs. Among pruning methodologies, unstructured pruning has been known to achieve the highest performance for a given pruning ratio, compared to structured pruning, since the latter imposes constraints on the sparsification structure. This is intuitive, as the solution space of unstructured pruning subsumes that of structured pruning. However, our counterintuitive finding reveals that expert pruning, a form of structured pruning, can actually precede unstructured pruning to outperform unstructured-only pruning. As existing expert pruning, requiring $O(\frac{k^n}{\sqrt{n}})$ forward passes for $n$ experts, cannot scale for recent MoEs, we propose a scalable alternative with $O(1)$ complexity, yet outperforming the more expensive methods. The key idea is leveraging a latent structure between experts, based on behavior similarity, such that the greedy decision of whether to prune closely captures the joint pruning effect. Ours is highly effective -- for Snowflake Arctic, a 480B-sized MoE with 128 experts, our method needs only one H100 and two hours to achieve nearly no loss in performance with 40% sparsity, even in generative tasks such as GSM8K, where state-of-the-art unstructured pruning fails to. The code will be made publicly available.
翻译:混合专家模型(MoEs)通过在大语言模型(LLMs)中稀疏激活专家,已被用于降低推理成本。尽管成本有所降低,但MoEs中庞大的专家数量仍使其部署开销高昂。本文研究如何通过剪枝MoEs来解决这一问题。在各类剪枝方法中,非结构化剪枝因其不对稀疏化结构施加约束,已知在给定剪枝比例下能比结构化剪枝获得更高的性能。这是直观的,因为非结构化剪枝的解空间包含了结构化剪枝的解空间。然而,我们的一项反直觉发现表明,专家剪枝(一种结构化剪枝形式)实际上可以先于非结构化剪枝进行,并能超越仅使用非结构化剪枝的性能。由于现有的专家剪枝方法需要对$n$个专家进行$O(\frac{k^n}{\sqrt{n}})$次前向传播,无法扩展到近期的MoEs模型,我们提出了一种复杂度为$O(1)$的可扩展替代方案,且性能优于更昂贵的方法。其核心思想是利用专家之间基于行为相似性的潜在结构,使得关于是否剪枝的贪心决策能够紧密地捕捉联合剪枝效应。我们的方法极为高效——对于拥有128位专家、参数量达480B的Snowflake Arctic模型,我们的方法仅需一块H100 GPU和两小时,即可在达到40%稀疏度时实现几乎无损的性能,甚至在GSM8K等生成式任务中也是如此,而当前最先进的非结构化剪枝方法在这些任务上则无法做到。代码将公开提供。