Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on comparably short texts to far longer texts. A heavy bunch of efforts have been dedicated to boosting the extrapolation via extending the formulations of the RoPE, however, few of them have attempted to showcase their inner workings comprehensively. In this paper, we are driven to offer a straightforward yet in-depth understanding of RoPE extensions from an attention perspective and on two benchmarking tasks. A broad array of experiments reveals several valuable findings: 1) Maintaining attention patterns to those at the pretrained length improves extrapolation; 2) Large attention uncertainty leads to retrieval errors; 3) Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
翻译:使大型语言模型能够处理长上下文是当前的研究热点。大多数大型语言模型都建立在旋转位置编码(RoPE)这一流行的位置编码方法之上。因此,一个重要的研究方向是将RoPE从在相对较短文本上训练的结果外推至远长文本。已有大量工作致力于通过扩展RoPE的公式来提升外推能力,然而,鲜有研究全面揭示其内部工作机制。本文旨在从注意力机制的视角出发,基于两项基准测试任务,提供一种对RoPE扩展的直观而深入的理解。广泛的实验揭示了若干有价值的发现:1)保持与预训练长度一致的注意力模式有助于改善外推性能;2)较大的注意力不确定性会导致检索错误;3)为RoPE扩展使用更长的持续预训练长度可以降低注意力不确定性,并显著增强外推能力。