It is well known that LLMs cannot generalize well to long contexts whose lengths are larger than the training sequence length. This poses challenges when employing LLMs for processing long input sequences during inference. In this work, we argue that LLMs themselves have inherent capabilities to handle long contexts without fine-tuning. To achieve this goal, we propose SelfExtend to extend the context window of LLMs by constructing bi-level attention information: the grouped attention and the neighbor attention. The grouped attention captures the dependencies among tokens that are far apart, while neighbor attention captures dependencies among adjacent tokens within a specified range. The two-level attentions are computed based on the original model's self-attention mechanism during inference. With minor code modification, our SelfExtend can effortlessly extend existing LLMs' context window without any fine-tuning. We conduct comprehensive experiments on multiple benchmarks and the results show that our SelfExtend can effectively extend existing LLMs' context window length. The code can be found at \url{https://github.com/datamllab/LongLM}.
翻译:众所周知,大语言模型(LLM)难以泛化到长度超过训练序列长度的长上下文场景。这在推理阶段使用LLM处理长输入序列时带来了挑战。本工作中,我们认为LLM自身具备无需微调即可处理长上下文的内在能力。为实现此目标,我们提出SelfExtend方法,通过构建双层注意力信息来扩展LLM的上下文窗口:分组注意力与邻近注意力。分组注意力捕捉相距较远的词元间依赖关系,而邻近注意力则捕捉指定范围内相邻词元间的依赖关系。这两层注意力均在推理时基于原始模型的自注意力机制进行计算。通过极少的代码修改,我们的SelfExtend方法无需任何微调即可轻松扩展现有LLM的上下文窗口。我们在多个基准测试上进行了全面实验,结果表明SelfExtend能有效扩展现有LLM的上下文窗口长度。代码可见于 \url{https://github.com/datamllab/LongLM}。