Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer language models, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.
翻译:尽管近期涌现了许多关于资源高效Transformer语言模型中专家混合(MoE)方法的研究,但现有工作主要集中在前馈层的MoE应用。先前将MoE扩展至自注意力层的尝试均未能达到参数匹配基准模型的性能水平。我们提出的新型SwitchHead是一种针对注意力层的有效MoE方法,在保持语言建模性能与基准Transformer持平的同时,成功降低了计算与内存需求,并实现了实际运行速度的提升。该创新MoE机制使SwitchHead能够比标准Transformer减少高达8倍的注意力矩阵计算量。SwitchHead还可与MoE前馈层结合,形成完全MoE化的"SwitchAll" Transformer架构。在C4数据集上训练的262M参数模型中,SwitchHead仅需44%的计算资源和27%的内存占用即可达到标准模型的困惑度水平。下游任务的零样本实验进一步验证了SwitchHead的性能优势,例如在同等计算资源条件下,其BliMP评测指标相比基准模型获得超过3.5%的绝对提升。