Many positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion: tokens farther away from the current position carry less relevant information. We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information from arbitrary positions. Firstly, we present empirical analyses on various PEs, demonstrating that models inherently learn attention with only a local-decay pattern while forming a U-shape pattern globally, contradicting the principle of long-term decay. Furthermore, we conduct a detailed analysis of rotary position encoding (RoPE, a prevalent relative positional encoding in LLMs), and found that the U-shape attention is caused by some learned components, which are also the key factor limiting RoPE's expressiveness and extrapolation.Inspired by these insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE replaces the specific components in RoPE with position-independent ones, retaining only high-frequency signals, which also breaks the principle of long-term decay in theory. HoPE achieves two major advantages: (1) Without constraints imposed by long-term decay, contradictory factors that limit spontaneous attention optimization and model extrapolation performance are removed. (2) Components representing positions and semantics are are optimized. These enhances model's context awareness and extrapolation, as validated by extensive experiments.
翻译:许多位置编码(PEs)被设计为呈现长期衰减特性,这基于一个根深蒂固且长期存在的归纳观点:距离当前位置越远的词元携带的相关信息越少。我们认为,在大语言模型(LLM)时代,长期衰减已经过时,因为LLMs现在被应用于需要从任意位置精确检索上下文信息的任务。首先,我们对多种PE进行了实证分析,表明模型在学习注意力时,本质上仅形成局部衰减模式,而在全局上形成U形模式,这与长期衰减的原则相矛盾。此外,我们对旋转位置编码(RoPE,LLMs中一种流行的相对位置编码)进行了详细分析,发现U形注意力是由某些学习到的组件引起的,这些组件也是限制RoPE表达能力和外推能力的关键因素。受这些见解启发,我们提出了高频旋转位置编码(HoPE)。HoPE将RoPE中的特定组件替换为与位置无关的组件,仅保留高频信号,这在理论上也打破了长期衰减的原则。HoPE实现了两大优势:(1)摆脱了长期衰减施加的约束,消除了限制注意力自发优化和模型外推性能的矛盾因素。(2)表示位置和语义的组件得到优化。这些优势增强了模型的上下文感知和外推能力,并通过大量实验得到了验证。