Mixture-of-Experts large language models (LLMs) scale efficiently through sparse activation, yet their deployment is fundamentally constrained by the large static parameter footprint of experts. Existing compression approaches either remove entire experts, disrupting routing topology and harming performance, or rely on unstructured weight pruning with limited practical efficiency. To address the limitations, we propose TENP, a structured Trapezoidal ExpertNeuron Pruning framework. Using a few samples, we identify and retain important experts, while applying expert neuron pruning (ENP) to less important experts, reserving model parameters in a trapezoidal pattern from shallow to deep layers. When evaluating expert importance, we jointly consider both the magnitude of the expert output and its ability to change the direction of the input vector. For ENP, we measure each neuron's projected contribution to the expert output to identify and retain important neurons. We conduct extensive experiments on the Qwen and DeepSeek models. Under a routing expert sparsity of 40% and an average of 63.76% activated expert parameters, the DeepSeek model suffers only a 1-point drop in accuracy compared to the full-parameter model. Moreover, it outperforms the full-parameter model by 10% on code generation tasks.
翻译:混合专家大语言模型通过稀疏激活实现高效扩展,但其部署从根本上受限于专家模块庞大的静态参数占用。现有压缩方法要么直接移除整个专家,破坏路由拓扑结构并损害性能,要么依赖实际效率有限的结构化权重剪枝。为解决上述局限,我们提出TENP——一种结构化的梯形专家神经元剪枝框架。通过少量样本,我们识别并保留重要专家,同时对次要专家进行专家神经元剪枝,以从浅层到深层的梯形模式保留模型参数。在评估专家重要性时,我们联合考量专家输出的幅值及其改变输入向量方向的能力。对于专家神经元剪枝,我们度量每个神经元对专家输出的投影贡献以识别并保留重要神经元。我们在Qwen和DeepSeek模型上进行了大量实验。在路由专家稀疏度为40%、专家激活参数平均占比63.76%的条件下,DeepSeek模型相较于全参数模型仅损失1个百分点的准确率,且在代码生成任务上以10%的优势超越全参数模型。