Length extrapolation has attracted considerable attention recently since it allows transformers to be tested on longer sequences than those used in training. Previous research has shown that this property can be attained by using carefully designed Relative Positional Encodings (RPEs). While these methods perform well on a variety of corpora, the conditions for length extrapolation have yet to be investigated. This paper attempts to determine what types of RPEs allow for length extrapolation through a thorough mathematical and empirical analysis. We discover that a transformer is certain to possess this property as long as the series that corresponds to the RPE's exponential converges. Two practices are derived from the conditions and examined in language modeling tasks on a variety of corpora. As a bonus from the conditions, we derive a new Theoretical Receptive Field (TRF) to measure the receptive field of RPEs without taking any training steps. Extensive experiments are conducted on the Wikitext-103, Books, Github, and WikiBook datasets to demonstrate the viability of our discovered conditions. We also compare TRF to Empirical Receptive Field (ERF) across different models, showing consistently matched trends on the aforementioned datasets. The code is available at https://github.com/OpenNLPLab/Rpe.
翻译:长度外推近期引起了广泛关注,因为它使Transformer能够处理比训练时更长的序列。以往研究表明,通过精心设计的相对位置编码(RPEs)可以实现这一特性。尽管这些方法在多种语料上表现出色,但长度外推的条件尚未得到深入探究。本文试图通过严密的数学与实证分析,确定何种RPEs能赋予模型长度外推能力。我们发现,只要RPE指数对应级数收敛,Transformer必然具备该性质。基于该条件推导出两种实践方法,并在多种语料的语言建模任务中进行了检验。作为条件推导的附带成果,我们提出了一种新的理论感受野(TRF),可在不进行任何训练步骤的情况下度量RPEs的感受野。我们在Wikitext-103、Books、Github和WikiBook数据集上进行了大量实验,验证了所发现条件的有效性。此外,我们还将TRF与不同模型上的经验感受野(ERF)进行对比,结果显示在上述数据集中趋势高度一致。相关代码已开源至https://github.com/OpenNLPLab/Rpe。