Extending the context window of language models typically requires expensive long-context pre-training, posing significant challenges for both training efficiency and data collection. In this paper, we present evidence that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation, even when training exclusively on packed short-context samples within a long-context window. We provide comprehensive insights through the lens of Rotary Position Embedding (RoPE) and establish three key findings. First, consistent with prior work, we show that phase-wise RoPE scaling, which maximizes rotational spectrum utilization at each training stage, also achieves the best long-context performance in knowledge distillation setups. Second, we demonstrate that logit-based knowledge distillation can directly enable positional information transfer. Using an experimental setup with packed repeated token sequences, we trace the propagation of positional perturbations from query and key vectors through successive transformer layers to output logits, revealing that positional information systematically influences the teacher's output distribution and, in turn, the distillation signal received by the student model. Third, our analysis uncovers structured update patterns in the query state during long-context extension, with distinct parameter spans exhibiting strong sensitivity to long-context training.
翻译:扩展语言模型的上下文窗口通常需要昂贵的预训练长上下文,这对训练效率和数据收集均构成重大挑战。本文通过logit知识蒸馏,证明即使在长上下文窗口内仅使用打包的短上下文样本进行训练,也能将长上下文检索能力迁移至学生模型。我们借助旋转位置编码的视角提供全面见解,并建立三项关键发现。首先,与先前研究一致,我们表明分阶段旋转位置编码缩放能在蒸馏设置中最大化各训练阶段旋转频谱利用率,同时实现最优的长上下文性能。其次,我们证明基于logit的知识蒸馏可直接实现位置信息迁移。通过采用打包重复令牌序列的实验设置,我们追踪了位置扰动从查询向量和键向量经连续Transformer层传播至输出logit的过程,揭示位置信息系统性影响教师模型的输出分布,进而影响学生模型接收的蒸馏信号。最后,我们的分析发现长上下文扩展期间查询状态存在结构化更新模式,不同参数区间对长上下文训练表现出强敏感性。