Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models

Recently, Large language models (LLMs) have revolutionized Natural Language Processing (NLP). Pretrained LLMs, due to limited training context size, struggle with handling long token sequences, limiting their performance on various downstream tasks. Current solutions toward long context modeling often employ multi-stage continual pertaining, which progressively increases the effective context length through several continual pretraining stages. However, those approaches require extensive manual tuning and human expertise. In this paper, we introduce a novel single-stage continual pretraining method, Head-Adaptive Rotary Position Encoding (HARPE), to equip LLMs with long context modeling capabilities while simplifying the training process. Our HARPE leverages different Rotary Position Encoding (RoPE) base frequency values across different attention heads and directly trains LLMs on the target context length. Extensive experiments on 4 language modeling benchmarks, including the latest RULER benchmark, demonstrate that HARPE excels in understanding and integrating long-context tasks with single-stage training, matching and even outperforming existing multi-stage methods. Our results highlight that HARPE successfully breaks the stage barrier for training LLMs with long context modeling capabilities.

翻译：近年来，大语言模型（LLMs）彻底改变了自然语言处理（NLP）领域。由于训练上下文长度有限，预训练LLMs在处理长令牌序列时存在困难，这限制了其在各种下游任务上的性能。当前的长上下文建模解决方案通常采用多阶段持续预训练，通过多个连续的预训练阶段逐步增加有效上下文长度。然而，这些方法需要大量的人工调优和专业知识。本文提出了一种新颖的单阶段持续预训练方法——头部自适应旋转位置编码（HARPE），旨在赋予LLMs长上下文建模能力，同时简化训练过程。我们的HARPE在不同注意力头之间利用不同的旋转位置编码（RoPE）基频值，并直接在目标上下文长度上训练LLMs。在包括最新的RULER基准在内的4个语言建模基准上进行的大量实验表明，HARPE通过单阶段训练，在理解和整合长上下文任务方面表现出色，达到甚至超越了现有的多阶段方法。我们的结果凸显了HARPE成功打破了为LLMs赋予长上下文建模能力的训练阶段壁垒。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日