Rotary positional embeddings (RoPE) are widely used in large language models to encode token positions through multiplicative rotations, yet their behavior at long context lengths remains poorly characterized. In this work, we reinterpret RoPE as phase modulation applied to a bank of complex oscillators, enabling analysis through classical signal processing theory. Under this formulation, we derive principled lower bounds on the RoPE base parameter that are necessary to preserve positional coherence over a target context length. These include a fundamental aliasing bound, analogous to a Nyquist limit, and a DC-component stability bound that constrains phase drift in low-frequency positional modes. We further extend this analysis to deep transformers, showing that repeated rotary modulation across layers compounds angular misalignment, tightening the base requirement as depth increases. Complementing these results, we derive a precision-dependent upper bound on the RoPE base arising from finite floating-point resolution. Beyond this limit, incremental phase updates become numerically indistinguishable, leading to positional erasure even in the absence of aliasing. Together, the lower and upper bounds define a precision- and depth-dependent feasibility region a Goldilocks zone for long-context transformers. We validate the framework through a comprehensive case study of state-of-the-art models, including LLaMA, Mistral, and DeepSeek variants, showing that observed successes, failures, and community retrofits align closely with the predicted bounds. Notably, models that violate the stability bound exhibit attention collapse and long-range degradation, while attempts to scale beyond one million tokens encounter a hard precision wall independent of architecture or training.
翻译:旋转位置嵌入(RoPE)在大型语言模型中广泛用于通过乘法旋转编码词元位置,但其在长上下文长度下的行为仍缺乏充分表征。本研究将RoPE重新阐释为施加于一组复振荡器的相位调制,从而可通过经典信号处理理论进行分析。在此框架下,我们推导出RoPE基数参数的理论下界,这些下界对于在目标上下文长度内保持位置相干性是必要的。其中包括一个基本的混叠界限(类似于奈奎斯特极限)和一个直流分量稳定性界限,后者约束低频位置模式的相位漂移。我们进一步将此分析拓展至深度Transformer,证明跨层的重复旋转调制会加剧角度失准,从而随着深度增加收紧基数要求。作为补充,我们推导出由有限浮点精度引起的RoPE基数上界。超越此界限后,增量相位更新将在数值上无法区分,导致即使在没有混叠的情况下也会出现位置信息擦除。下界与上界共同定义了精度和深度相关的可行区域——即长上下文Transformer的"适宜区间"。我们通过对包括LLaMA、Mistral和DeepSeek变体在内的前沿模型进行全面案例研究,验证了该框架的有效性,结果表明观察到的成功案例、失败案例以及社区改进方案均与预测界限高度吻合。值得注意的是,违反稳定性界限的模型会出现注意力崩溃和长程性能退化,而尝试将上下文扩展至百万词元以上的模型则会遭遇与架构或训练无关的硬性精度壁垒。