Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. We propose a novel functional relative position encoding with progressive interpolation, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
翻译:防止Transformer在输入长度超过训练长度时性能衰减,已成为扩展这类模型上下文长度的关键挑战。尽管Transformer架构在理论上能处理任意长度的输入序列,但训练时选用的位置编码会限制模型在长输入上的表现。我们提出了一种新颖的渐进插值函数式相对位置编码方法(FIRE),旨在提升Transformer对长上下文的泛化能力。本研究从理论上证明该方法能够表示T5的RPE、Alibi和Kerple等主流相对位置编码。实验表明,在零样本语言建模和长文本基准测试中,FIRE模型对长上下文具有更优的泛化性能。