For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.
翻译:多年来,机器学习中的模型性能与模型规模遵循幂律关系。出于参数效率的考虑,近期研究更倾向于增加模型深度而非宽度以提升性能。本文通过参数高效的多路径结构,研究模型宽度如何影响Transformer模型。为更好融合不同路径提取的特征,我们在每个子层增加了三项额外操作:每条路径末端的归一化、用于生成更多特征的低成本操作、以及用于灵活融合所有特征的可学习加权机制。在12项WMT机器翻译任务上的大量实验表明,在参数数量相同的情况下,较浅的多路径模型能达到甚至超越较深模型的性能。这揭示了我们应更多关注多路径结构,并在模型深度与宽度之间寻求平衡,以训练更优的大规模Transformer模型。