Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness. Our code is available at https://github.com/deepseek-ai/ESFT.
翻译:参数高效微调(PEFT)对于在有限资源下定制大语言模型(LLM)至关重要。尽管已有多种面向密集架构LLM的PEFT方法,但针对稀疏架构LLM的PEFT研究仍显不足。本文研究了具有混合专家(MoE)架构的LLM的PEFT方法,主要内容包括三个方面:(1)我们研究了定制任务中激活专家的分散程度,发现特定任务的路由分布往往高度集中,而不同任务间激活专家的分布差异显著。(2)我们提出了专家专用微调(ESFT),该方法仅微调与下游任务最相关的专家,同时冻结其他专家和模块;实验结果表明,我们的方法不仅提高了微调效率,而且达到甚至超越了全参数微调的性能。(3)我们进一步分析了MoE架构对专家专用微调的影响。我们发现,具有更细粒度专家的MoE模型在选取与下游任务最相关的专家组合方面更具优势,从而同时提升了训练效率和效果。我们的代码可在 https://github.com/deepseek-ai/ESFT 获取。