LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

In recent years, there have been remarkable advancements in the performance of Transformer-based Large Language Models (LLMs) across various domains. As these LLMs are deployed for increasingly complex tasks, they often face the need to conduct longer reasoning processes or understand larger contexts. In these situations, the length generalization failure of LLMs on long sequences becomes more prominent. Most pre-training schemes truncate training sequences to a fixed length. LLMs often struggle to generate fluent and coherent texts, let alone carry out downstream tasks, after longer contexts, even with relative positional encoding designed to cope with this problem. Common solutions such as finetuning on longer corpora often involve daunting hardware and time costs and require careful training process design. To more efficiently leverage the generation capacity of existing LLMs, we theoretically and empirically investigate the main out-of-distribution (OOD) factors contributing to this problem. Inspired by this diagnosis, we propose a simple yet effective solution for on-the-fly length generalization, LM-Infinite. It involves only a $\Lambda$-shaped attention mask (to avoid excessive attended tokens) and a distance limit (to avoid unseen distances) while requiring no parameter updates or learning. We find it applicable to a variety of LLMs using relative-position encoding methods. LM-Infinite is computationally efficient with $O(n)$ time and space, and demonstrates consistent text generation fluency and quality to as long as 32k tokens on ArXiv and OpenWebText2 datasets, with 2.72x decoding speedup. On downstream tasks such as passkey retrieval, it continues to work on inputs much longer than training lengths where vanilla models fail immediately.

翻译：近年来，基于Transformer的大型语言模型（LLMs）在各领域的性能取得了显著进展。随着这些LLMs被部署于日益复杂的任务，它们往往需要执行更长的推理过程或理解更大的上下文。在此类场景中，LLMs在长序列上的长度泛化失败问题愈发突出。大多数预训练方案将训练序列截断为固定长度，即使采用旨在解决该问题的相对位置编码，LLMs在长上下文下仍难以生成流畅连贯的文本，更遑论完成下游任务。常见的解决方案（如基于长语料的微调）往往需要高昂的硬件和时间成本，并需精心设计训练流程。为更高效地利用现有LLMs的生成能力，我们从理论与实证角度系统探究了导致该问题的主要分布外（OOD）因素。受此诊断启发，我们提出了一种简单有效的即时长度泛化解法——LM-Infinite。该方法仅需引入Λ形注意力掩码（以避免过多被关注令牌）和距离限制（以避免未出现过的距离），无需参数更新或学习。我们发现该方法适用于多种采用相对位置编码的LLMs。LM-Infinite在计算上高效（时间与空间复杂度均为O(n)），在ArXiv和OpenWebText2数据集上，可对长达32k令牌的文本保持一致的生成流畅度与质量，解码速度提升2.72倍。在密钥检索等下游任务中，该方法能在远超训练长度的输入上持续工作，而普通模型在此类输入上立即失效。