Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con\textbf{T}extualized equivari\textbf{A}nt \textbf{P}osition \textbf{E}ncoding (\textbf{TAPE}), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. We show that TAPE can provably facilitate LLM reasoning ability by emulating a broader class of algorithms. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving long-context ability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. Code is available at https://github.com/VITA-Group/TAPE.
翻译:Transformer模型依赖基于内容和基于位置的寻址机制进行预测,但现有的位置编码技术往往会削弱基于位置寻址的有效性。当前许多方法在注意力图中强制施加刚性模式,限制了模型长距离依赖建模能力及对不同任务的适应性。此外,大多数位置编码被学习为通用偏置,缺乏对数据集中不同实例所需的专门化能力。为解决此问题,我们提出上下文等变性位置编码(TAPE),这是一个通过在各层中融入序列内容来增强位置嵌入的新颖框架。TAPE引入了动态、上下文感知的位置编码,克服了传统固定模式的限制。我们证明,TAPE能够通过模拟更广泛的算法类别,从理论上促进大语言模型的推理能力。通过强制置换等变性和正交等变性,TAPE确保了位置编码在更新过程中的稳定性,从而提升了长上下文处理能力。我们的方法可以轻松集成到预训练的Transformer模型中,以极小的开销实现参数高效的微调。大量实验表明,与现有的位置嵌入技术相比,TAPE在语言建模、算术推理和长上下文检索任务中均取得了更优的性能。代码发布于https://github.com/VITA-Group/TAPE。