Large and sparse feed-forward networks (S-FFN) such as Mixture-of-Experts (MoE) have demonstrated to be an efficient approach for scaling up Transformers model size for pretraining large language models. By only activating part of the FFN parameters conditioning on input, S-FFN improves generalization performance while keeping training and inference costs (in FLOPs) fixed. In this work, we analyzed the two major design choices of S-FFN: the memory block (or expert) size and the memory block selection method under a general conceptual framework of sparse neural memory. Using this unified framework, we compare several S-FFN architectures for language modeling and provide insights into their relative efficacy and efficiency. From our analysis results, we found a simpler selection method -- Avg-K that selects blocks through their mean aggregated hidden states, achieves lower perplexity in language modeling pretraining compared to existing MoE architectures.
翻译:大型稀疏前馈网络(S-FFN),例如混合专家模型(MoE),已被证明是扩展Transformer模型规模以进行大语言模型预训练的高效方法。通过根据输入仅激活部分FFN参数,S-FFN在保持训练和推理成本(以FLOPs计)不变的同时,提升了泛化性能。在本工作中,我们在稀疏神经记忆的通用概念框架下,分析了S-FFN的两大主要设计选择:记忆块(或专家)大小以及记忆块选择方法。利用这一统一框架,我们比较了多种用于语言建模的S-FFN架构,并深入剖析了它们的相对效能与效率。根据分析结果,我们发现一种更简单的选择方法——Avg-K(通过块的平均聚合隐状态选择块),在语言建模预训练中实现了比现有MoE架构更低的困惑度。