Pre-trained language models have been proven to possess strong base capabilities, which not only excel in in-distribution language modeling but also show powerful abilities in out-of-distribution language modeling, transfer learning and few-shot learning. Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models? In this work, we attempt to explain and reverse the decline in base capabilities caused by the architecture of FFN-Wider Transformers, seeking to provide some insights. Through analysis, we found the contribution ratio of Multi-Head Attention (a combination function) to pre-trained language modeling is a key factor affecting base capabilities. FFN-Wider Transformers reduce the contribution ratio of this combination function, leading to a decline in base capabilities. We confirmed this by experiments and proposed Combination Enhanced Architecture (CEA) to address the decline in base capabilities of such models. Significantly, we extended our explanation and CEA to Mixture of Experts (MoE) Transformers. We successfully achieved significant improvements in base capabilities on a 14B parameter MoE model, demonstrating the practical application value of our work. This also indicates that our analysis has a certain guiding significance for architecture analysis, architecture improvement and architecture design.
翻译:预训练语言模型已被证明具备强大的基础能力,不仅在分布内语言建模中表现优异,在分布外语言建模、迁移学习和少样本学习中也展现出强大能力。与现有研究关注规模对基础能力的影响不同,本文探讨了架构对其的影响。具体而言,我们关注的问题是:架构如何影响预训练语言模型的基础能力?本研究试图解释并逆转由FFN-Wider Transformers架构引起的基础能力下降现象,以提供一些启示。通过分析,我们发现多头注意力机制(一种组合函数)对预训练语言建模的贡献率是影响基础能力的关键因素。FFN-Wider Transformers降低了该组合函数的贡献率,导致基础能力下降。我们通过实验验证了这一发现,并提出组合增强架构(CEA)以解决此类模型的基础能力下降问题。值得注意的是,我们将这一解释和CEA扩展至混合专家(MoE)Transformers。我们在一个140亿参数的MoE模型上成功实现了基础能力的显著提升,证明了本研究的实际应用价值。这也表明,我们的分析对架构分析、架构改进与架构设计具有一定的指导意义。