Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state-of-the-art for length generalization. Our code, data, and models are available at https://github.com/hrlics/CoPE.
翻译:旋转位置编码(RoPE)是大语言模型(LLM)中实现上下文扩展的关键组件。尽管已有多种方法被提出以适配RoPE至更长上下文,其指导原则大致可分为两类:(1)分布外缓解,即通过缩放RoPE频率以适应未见位置;(2)语义建模,其假设基于RoPE计算的注意力分数应始终优先考虑语义相似的词元。在本工作中,我们通过一种极简的干预手段——即CoPE:对RoPE的低频分量进行软截断——统一了这两个看似不同的目标。CoPE不仅消除了分布外异常值并优化了语义信号,还避免了硬截断引起的频谱泄漏。大量实验表明,仅需将我们的软截断策略应用于RoPE,即可在长达256k的上下文长度上获得显著且可扩展的性能提升,这验证了我们的理论分析,并使CoPE成为长度泛化领域的新技术标杆。我们的代码、数据及模型已发布于 https://github.com/hrlics/CoPE。