How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce several novel perspectives on MoEs, presenting a general framework that unifies various methods to approximate two-layer NNs (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Leveraging insights from this framework, we propose methods to improve both MoEs and PKMs. Unlike prior work that compares MoEs with dense baselines under the compute-equal condition, our evaluation condition is parameter-equal, which is crucial to properly evaluate LMs. We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs. Our code is public.
翻译:如何在不牺牲性能的前提下降低神经网络的算力与内存需求?近期诸多研究通过稀疏专家混合模型构建资源高效的大型语言模型。本文提出关于专家混合模型的若干新视角,构建了一个统一多种方法的通用框架来近似双层神经网络(如Transformer的前馈模块),其中包含乘积键记忆方法。基于该框架的洞见,我们提出了改进专家混合模型与乘积键记忆的方法。与既往在算力均衡条件下对比专家混合模型与密集基线模型的研究不同,我们的评估条件为参数均衡——这对合理评估语言模型至关重要。实验表明,在WikiText-103和enwiki8两个数据集上,我们的专家混合模型在两种不同规模下均能与密集Transformer-XL模型相媲美,同时资源效率显著提升。这证明专家混合模型不仅适用于超大规模语言模型,对任意规模的资源高效语言模型均有重要意义。我们的代码已开源。