LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

In recent years, there have been remarkable advancements in the performance of Transformer-based Large Language Models (LLMs) across various domains. As these LLMs are deployed for increasingly complex domains, they often face the need to follow longer user prompts or generate longer texts. In these situations, the $\textit{length generalization failure}$ of LLMs on long sequences becomes more prominent. Most pre-training schemes truncate training sequences to a fixed length. LLMs often struggle to generate fluent and coherent texts after longer contexts, even with relative positional encoding specifically designed to cope with this problem. Common solutions such as finetuning on longer corpora often involve daunting hardware and time costs and require careful training process design. To more efficiently extrapolate existing LLMs' generation quality to longer texts, we theoretically and empirically investigate the main out-of-distribution (OOD) factors contributing to this problem. Inspired by this diagnosis, we propose a simple yet effective solution for on-the-fly length generalization, LM-Infinite. It involves only a $\mathbf{\Lambda}$-shaped attention mask (to avoid excessive attended tokens) and a distance limit (to avoid unseen distances) while requiring no parameter updates or learning. We find it applicable to a variety of LLMs using relative-position encoding methods. LM-Infinite is computationally efficient with $O(n)$ time and space, and demonstrates consistent text generation fluency and quality to as long as 128k tokens on ArXiv and OpenWebText2 datasets, with 2.72x decoding speedup. We will make the codes publicly available following publication.

翻译：近年来，基于Transformer的大型语言模型（LLMs）在多个领域取得了显著性能提升。随着这些LLMs被部署至日益复杂的应用场景，它们常需遵循更长的用户提示或生成长文本。在此类情况下，LLMs在长序列上存在的"长度泛化失败"问题愈发突出。多数预训练方案将训练序列截断至固定长度，即便采用专门设计的相对位置编码，LLMs在长上下文后仍难以生成流畅连贯的文本。通常的解决方案（如对长语料进行微调）往往涉及昂贵的硬件与时间成本，且需精心设计训练流程。为更高效地扩展现有LLMs在长文本上的生成质量，我们从理论与实证角度探究导致该问题的主要分布外（OOD）因素。基于诊断结果，我们提出一种简单高效的即时长度泛化方案——LM-Infinite。该方法仅需引入一个Λ形注意力掩码（避免过度聚集的标记）与距离限制（避免未见距离），无需参数更新或学习。我们证实该方法可适用于多种采用相对位置编码的LLMs。LM-Infinite具备O(n)时间与空间复杂度的高计算效率，在ArXiv与OpenWebText2数据集上，对长达128k token的文本能保持一致的生成流畅度与质量，并实现2.72倍解码加速。论文发表后，我们将公开相关代码。