LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

In recent years, there have been remarkable advancements in the performance of Transformer-based Large Language Models (LLMs) across various domains. As these LLMs are deployed for increasingly complex tasks, they often face the needs to conduct longer reasoning processes or understanding larger contexts. In these situations, the length generalization failure of LLMs on long sequences become more prominent. Most pre-training schemes truncate training sequences to a fixed length (such as 2048 for LLaMa). LLMs often struggle to generate fluent texts, let alone carry out downstream tasks, after longer contexts, even with relative positional encoding which is designed to cope with this problem. Common solutions such as finetuning on longer corpora often involves daunting hardware and time costs and requires careful training process design. To more efficiently leverage the generation capacity of existing LLMs, we theoretically and empirically investigate the main out-of-distribution (OOD) factors contributing to this problem. Inspired by this diagnosis, we propose a simple yet effective solution for on-the-fly length generalization, LM-Infinite, which involves only a $\Lambda$-shaped attention mask and a distance limit while requiring no parameter updates or learning. We find it applicable to a variety of LLMs using relative-position encoding methods. LM-Infinite is computational efficient with $O(n)$ time and space, and demonstrates consistent fluency and generation quality to as long as 32k tokens on ArXiv and OpenWebText2 datasets, with 2.72x decoding speedup. On downstream task such as passkey retrieval, it continues to work on inputs much longer than training lengths where vanilla models fail immediately.

翻译：近年来，基于Transformer的大型语言模型（LLMs）在多个领域取得了显著性能提升。随着这些LLMs被部署到日益复杂的任务中，它们常常需要执行更长的推理过程或理解更大的上下文。在此类情境下，LLMs在长序列上的长度泛化失败问题愈发突出。大多数预训练方案将训练序列截断至固定长度（如LLaMa中的2048）。即便采用了旨在应对此问题的相对位置编码，LLMs在面对更长上下文时仍难以生成流畅文本，更遑论完成下游任务。常见的解决方案（如对长语料进行微调）往往涉及高昂的硬件与时间成本，并需要精细的训练流程设计。为更高效地利用现有LLMs的生成能力，我们从理论与实证角度探究了导致该问题的主要分布外（OOD）因素。受此诊断启发，我们提出了一种简单有效的即时长度泛化方法——LM-Infinite，该方法仅需引入一个Λ形注意力掩码和一个距离限制，无需任何参数更新或学习。我们发现该方法可适用于多种使用相对位置编码的LLMs。LM-Infinite在计算上具有O(n)时间与空间复杂度的高效性，并在ArXiv和OpenWebText2数据集上对长达32k个token的序列展现出持续的流畅性与生成质量，解码速度提升2.72倍。在密码检索等下游任务中，该方法能在远超训练长度的输入上正常运行，而原始模型在此类场景下会立即失效。