Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied byFLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.
翻译:稀疏混合专家(Sparse Mixture-of-Experts, MoE)是一种神经网络架构设计,可在不增加推理成本的前提下为大型语言模型(LLMs)增加可学习参数。指令微调是一种训练LLMs遵循指令的技术。我们主张将这两种方法相结合,因为研究发现MoE模型从指令微调中获得的收益优于稠密模型。具体而言,我们在三种实验设置下开展了实证研究:(i)在缺乏指令微调的情况下直接对下游任务进行微调;(ii)先进行指令微调,再在下游任务上进行上下文少样本或零样本泛化;(iii)指令微调后辅以对下游任务的进一步微调。在第一种情景中,MoE模型的整体表现劣于计算能力相同的稠密模型。然而,当引入指令微调(第二和第三种情景),无论是单独使用还是与任务特定微调相结合,这一局面发生了显著逆转。我们最强的模型FLAN-MOE-32B在四个基准测试任务上超越了FLAN-PALM-62B,而计算量(FLOPs)仅为其三分之一。FLAN-MOE所取得的进步启发我们重新审视在大规模高性能语言模型任务无关学习框架下的设计原则。