A pivotal advancement in the progress of large language models (LLMs) is the emergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs, MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. Different from previous weight pruning methods that rely on specifically designed hardware, this paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques. Specifically, we propose, for the first time to our best knowledge, post-training approaches for task-agnostic and task-specific expert pruning and skipping of MoE LLMs, tailored to improve deployment efficiency while maintaining model performance across a wide range of tasks. Extensive experiments show that our proposed methods can simultaneously reduce model sizes and increase the inference speed, while maintaining satisfactory performance. Data and code will be available at https://github.com/Lucky-Lance/Expert_Sparsity.
翻译:大语言模型(LLM)发展的一个关键进展是混合专家(MoE)大语言模型的涌现。与传统大语言模型相比,MoE大语言模型能以更少的参数达到更高的性能,但由于其庞大的参数量,部署仍面临困难。与以往依赖特定硬件设计的权重剪枝方法不同,本文主要通过引入即插即用的专家级稀疏化技术来提升MoE大语言模型的部署效率。具体而言,据我们所知,本文首次提出了面向MoE大语言模型的训练后任务无关与任务特定专家剪枝与跳跃方法,旨在兼顾模型性能的前提下提升部署效率。大量实验表明,我们提出的方法能在保持满意性能的同时,同时减小模型尺寸并提升推理速度。数据和代码将在 https://github.com/Lucky-Lance/Expert_Sparsity 公开。