The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.
翻译:标准Transformer架构中的前馈网络层随着隐藏层宽度的增加,其计算成本和激活内存呈线性增长。稀疏专家混合架构通过解耦模型规模与计算成本,已成为解决该问题的可行方案。近期发现的细粒度MoE缩放定律表明,更高的粒度能带来更好的性能表现。然而,由于计算和优化方面的挑战,现有MoE模型仅限于使用少量专家。本文提出参数高效专家检索层,这是一种利用乘积键技术从海量微型专家池中实现稀疏检索的新型层设计。在语言建模任务上的实验表明,PEER层在性能-计算权衡方面优于稠密前馈网络层和粗粒度MoE。通过实现对海量专家的高效利用,PEER释放了在保持计算效率的同时进一步扩展Transformer模型的潜力。