Positional encoding plays a crucial role in transformers, significantly impacting model performance and length generalization. Prior research has introduced absolute positional encoding (APE) and relative positional encoding (RPE) to distinguish token positions in given sequences. However, both APE and RPE remain fixed after model training regardless of input data, limiting their adaptability and flexibility. Hence, we expect that the desired positional encoding should be data-adaptive and can be dynamically adjusted with the given attention. In this paper, we propose a Data-Adaptive Positional Encoding (DAPE) method, which dynamically and semantically adjusts based on input context and learned fixed priors. Experimental validation on real-world datasets (Arxiv, Books3, and CHE) demonstrates that DAPE enhances model performances in terms of trained length and length generalization, where the improvements are statistically significant. The model visualization suggests that our model can keep both local and anti-local information. Finally, we successfully train the model on sequence length 128 and achieve better performance at evaluation sequence length 8192, compared with other static positional encoding methods, revealing the benefit of the adaptive positional encoding method.
翻译:位置编码在Transformer模型中扮演着关键角色,显著影响模型性能和长度泛化能力。先前研究引入了绝对位置编码(APE)和相对位置编码(RPE)来区分给定序列中词元的位置。然而,无论是APE还是RPE,在模型训练完成后即保持固定,不随输入数据而变化,这限制了其适应性和灵活性。因此,我们期望理想的位置编码应当具备数据自适应性,并能够根据给定的注意力机制进行动态调整。本文提出一种数据自适应位置编码(DAPE)方法,该方法能够基于输入上下文和已学习的固定先验,进行动态且语义感知的调整。在真实数据集(Arxiv、Books3和CHE)上的实验验证表明,DAPE在训练长度和长度泛化方面均提升了模型性能,且改进具有统计显著性。模型可视化结果表明,我们的模型能够同时保持局部与非局部信息。最终,我们在序列长度为128的数据上成功训练模型,并在评估序列长度8192上取得了优于其他静态位置编码方法的性能,这揭示了自适应位置编码方法的优势。