Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router-weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router-weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple-choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade-off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade-off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.
翻译:混合专家(MoE)大型语言模型(LLM)是性能领先的架构之一。这些最大的模型通常拥有数千亿参数,在部署时面临显著的内存挑战。降低内存需求的传统方法包括权重剪枝和量化。受基于路由器权重的专家激活剪枝(REAP)方法启发,我们提出了一种新方法——基于路由器权重的专家激活融合(REAM)。REAM并非移除专家,而是将其分组并融合权重,从而更好地保留原始性能。我们在多个MoE LLM上,针对多样化多项选择(MC)问答和生成(GEN)基准,将REAM与REAP及其他基线方法进行了评估。结果显示,在MC和GEN性能之间存在一种权衡,其取决于校准数据的混合比例。通过控制通用数据、数学数据和代码数据的混合比例,我们研究了这种权衡的帕累托前沿,并证明REAM通常优于基线方法,在许多情况下甚至与原始未压缩模型相当。