Recently, Large language models (LLMs) have revolutionized Natural Language Processing (NLP). Pretrained LLMs, due to limited training context size, struggle with handling long token sequences, limiting their performance on various downstream tasks. Current solutions toward long context modeling often employ multi-stage continual pertaining, which progressively increases the effective context length through several continual pretraining stages. However, those approaches require extensive manual tuning and human expertise. In this paper, we introduce a novel single-stage continual pretraining method, Head-Adaptive Rotary Position Encoding (HARPE), to equip LLMs with long context modeling capabilities while simplifying the training process. Our HARPE leverages different Rotary Position Encoding (RoPE) base frequency values across different attention heads and directly trains LLMs on the target context length. Extensive experiments on 4 language modeling benchmarks, including the latest RULER benchmark, demonstrate that HARPE excels in understanding and integrating long-context tasks with single-stage training, matching and even outperforming existing multi-stage methods. Our results highlight that HARPE successfully breaks the stage barrier for training LLMs with long context modeling capabilities.
翻译:近年来,大语言模型(LLMs)彻底改变了自然语言处理(NLP)领域。由于训练上下文长度有限,预训练LLMs在处理长令牌序列时存在困难,这限制了其在各种下游任务上的性能。当前的长上下文建模解决方案通常采用多阶段持续预训练,通过多个连续的预训练阶段逐步增加有效上下文长度。然而,这些方法需要大量的人工调优和专业知识。本文提出了一种新颖的单阶段持续预训练方法——头部自适应旋转位置编码(HARPE),旨在赋予LLMs长上下文建模能力,同时简化训练过程。我们的HARPE在不同注意力头之间利用不同的旋转位置编码(RoPE)基频值,并直接在目标上下文长度上训练LLMs。在包括最新的RULER基准在内的4个语言建模基准上进行的大量实验表明,HARPE通过单阶段训练,在理解和整合长上下文任务方面表现出色,达到甚至超越了现有的多阶段方法。我们的结果凸显了HARPE成功打破了为LLMs赋予长上下文建模能力的训练阶段壁垒。