Parameter-efficient tuning (PEFT) techniques like low-rank adaptation (LoRA) offer training efficiency on Large Language Models, but their impact on model performance remains limited. Recent efforts integrate LoRA and Mixture-of-Experts (MoE) to improve the performance of PEFT methods. Despite promising results, research on improving the efficiency of LoRA with MoE is still in its early stages. Recent studies have shown that experts in the MoE architecture have different strengths and also exhibit some redundancy. Does this statement also apply to parameter-efficient MoE? In this paper, we introduce a novel parameter-efficient MoE method, \textit{\textbf{M}oE-L\textbf{o}RA with \textbf{L}ayer-wise Expert \textbf{A}llocation (MoLA)} for Transformer-based models, where each model layer has the flexibility to employ a varying number of LoRA experts. We investigate several architectures with varying layer-wise expert configurations. Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines. We find that allocating more LoRA experts to higher layers further enhances the effectiveness of models with a certain number of experts in total. With much fewer parameters, this allocation strategy outperforms the setting with the same number of experts in every layer. This work can be widely used as a plug-and-play parameter-efficient tuning approach for various applications. The code is available at https://github.com/GCYZSL/MoLA.
翻译:参数高效微调技术(如低秩适应LoRA)能提升大语言模型的训练效率,但其对模型性能的影响仍有限。近期研究尝试融合LoRA与混合专家模型(MoE)以改进参数高效微调方法的性能。尽管已有初步成果,关于通过MoE提升LoRA效率的研究仍处于早期阶段。最新研究表明,MoE架构中的专家具有差异化优势,同时也存在冗余现象。这一结论是否适用于参数高效型MoE?本文针对基于Transformer的模型,提出了一种新型参数高效MoE方法——\textit{\textbf{逐层专家分配的MoE-LoRA(MoLA)}},该方法允许各模型层灵活配置不同数量的LoRA专家。我们探索了多种具有不同逐层专家配置的架构。在六个知名NLP与常识问答基准上的实验表明,MoLA能实现与所有基线方法相当或更优的性能。研究发现,在固定专家总数的情况下,为高层分配更多LoRA专家可进一步强化模型效果。与每层设置相同数量专家的方案相比,这种分配策略能以更少的参数取得更优结果。本工作可作为即插即用的参数高效微调方法广泛应用于各类任务。代码已开源在https://github.com/GCYZSL/MoLA。