LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

In recent years, there have been remarkable advancements in the performance of Transformer-based Large Language Models (LLMs) across various domains. As these LLMs are deployed for increasingly complex domains, they often face the need to follow longer user prompts or generate longer texts. In these situations, the $\textit{length generalization failure}$ of LLMs on long sequences becomes more prominent. Most pre-training schemes truncate training sequences to a fixed length. LLMs often struggle to generate fluent and coherent texts after longer contexts, even with relative positional encoding specifically designed to cope with this problem. Common solutions such as finetuning on longer corpora often involve daunting hardware and time costs and require careful training process design. To more efficiently extrapolate existing LLMs' generation quality to longer texts, we theoretically and empirically investigate the main out-of-distribution (OOD) factors contributing to this problem. Inspired by this diagnosis, we propose a simple yet effective solution for on-the-fly length generalization, LM-Infinite. It involves only a $\mathbf{\Lambda}$-shaped attention mask (to avoid excessive attended tokens) and a distance limit (to avoid unseen distances) while requiring no parameter updates or learning. We find it applicable to a variety of LLMs using relative-position encoding methods. LM-Infinite is computationally efficient with $O(n)$ time and space, and demonstrates consistent text generation fluency and quality to as long as 128k tokens on ArXiv and OpenWebText2 datasets, with 2.72x decoding speedup. We will make the codes publicly available following publication.

翻译：近年来，基于Transformer的大语言模型（LLMs）在多个领域取得了显著性能提升。随着这些模型被部署到日益复杂的应用场景中，它们常需处理更长的用户提示或生成更长的文本。在此情境下，LLMs在长序列上的"长度泛化失败"问题愈发凸显。多数预训练方案将训练序列截断至固定长度，导致即便使用针对该问题专门设计的相对位置编码，LLMs在处理更长上下文时仍难以生成流畅连贯的文本。常见解决方案（如在更长语料上进行微调）常涉及高昂的硬件与时间成本，且需精心设计训练流程。为更高效地将现有LLMs的生成质量外推至更长文本，我们通过理论与实证研究揭示了导致该问题的主要分布外（OOD）因素。基于此诊断，我们提出一种简洁而有效的即时长度泛化方案——LM-Infinite。该方法仅包含一个Λ形注意力掩码（避免过度关注已处理标记）和一个距离限制（避免处理未见距离），无需参数更新或学习过程。我们发现该方法适用于多种采用相对位置编码的LLMs。LM-Infinite在时间和空间复杂度上均达到O(n)量级的高计算效率，在ArXiv与OpenWebText2数据集上可对长达128k标记的文本保持一致的生成流畅度与质量，并实现2.72倍的解码加速。代码将在论文发表后公开提供。