Length extrapolation has attracted considerable attention recently since it allows transformers to be tested on longer sequences than those used in training. Previous research has shown that this property can be attained by using carefully designed Relative Positional Encodings (RPEs). While these methods perform well on a variety of corpora, the conditions for length extrapolation have yet to be investigated. This paper attempts to determine what types of RPEs allow for length extrapolation through a thorough mathematical and empirical analysis. We discover that a transformer is certain to possess this property as long as the series that corresponds to the RPE's exponential converges. Two practices are derived from the conditions and examined in language modeling tasks on a variety of corpora. As a bonus from the conditions, we derive a new Theoretical Receptive Field (TRF) to measure the receptive field of RPEs without taking any training steps. Extensive experiments are conducted on the Wikitext-103, Books, Github, and WikiBook datasets to demonstrate the viability of our discovered conditions. We also compare TRF to Empirical Receptive Field (ERF) across different models, showing consistently matched trends on the aforementioned datasets. The code is available at https://github.com/OpenNLPLab/Rpe.
翻译:长度外推近年来引起了广泛关注,因为它允许Transformer在比训练时更长的序列上进行测试。先前研究表明,通过使用精心设计的相对位置编码(RPEs)可以获得这一特性。尽管这些方法在多种语料库上表现良好,但长度外推的条件仍有待探究。本文旨在通过全面的数学与实证分析,确定哪些类型的RPEs能够实现长度外推。我们发现,只要与RPE指数级数对应的序列收敛,Transformer就必然具备这一特性。基于该条件推导出两种实践方法,并在多种语料库的语言建模任务中进行了检验。作为该条件的附加成果,我们推导出新的理论感受野(TRF),无需任何训练步骤即可度量RPEs的感受野。在Wikitext-103、Books、Github和WikiBook数据集上进行了大量实验,验证了所发现条件的可行性。我们还比较了不同模型上的TRF与经验感受野(ERF),结果显示在上述数据集中趋势高度一致。代码已开源至https://github.com/OpenNLPLab/Rpe。