LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

In recent years, there have been remarkable advancements in the performance of Transformer-based Large Language Models (LLMs) across various domains. As these LLMs are deployed for increasingly complex tasks, they often face the need to conduct longer reasoning processes or understand larger contexts. In these situations, the length generalization failure of LLMs on long sequences becomes more prominent. Most pre-training schemes truncate training sequences to a fixed length. LLMs often struggle to generate fluent and coherent texts, let alone carry out downstream tasks, after longer contexts, even with relative positional encoding designed to cope with this problem. Common solutions such as finetuning on longer corpora often involve daunting hardware and time costs and require careful training process design. To more efficiently leverage the generation capacity of existing LLMs, we theoretically and empirically investigate the main out-of-distribution (OOD) factors contributing to this problem. Inspired by this diagnosis, we propose a simple yet effective solution for on-the-fly length generalization, LM-Infinite. It involves only a $\Lambda$-shaped attention mask (to avoid excessive attended tokens) and a distance limit (to avoid unseen distances) while requiring no parameter updates or learning. We find it applicable to a variety of LLMs using relative-position encoding methods. LM-Infinite is computationally efficient with $O(n)$ time and space, and demonstrates consistent text generation fluency and quality to as long as 32k tokens on ArXiv and OpenWebText2 datasets, with 2.72x decoding speedup. On downstream tasks such as passkey retrieval, it continues to work on inputs much longer than training lengths where vanilla models fail immediately.

翻译：近年来，基于Transformer的大语言模型（LLMs）在多个领域取得了显著性能提升。随着这些LLMs被部署到日益复杂的任务中，它们经常需要执行更长的推理过程或理解更广泛的上下文。在这些场景下，LLMs在长序列上的长度泛化失效问题愈发突出。大多数预训练方案将训练序列截断为固定长度。即便采用旨在应对该问题的相对位置编码，LLMs在长上下文下仍难以生成流畅连贯的文本，遑论完成下游任务。针对长语料进行微调等常见解决方案通常需要高昂的硬件和时间成本，并要求精心设计训练流程。为更高效地利用现有LLMs的生成能力，我们从理论和实证角度探究了导致该问题的主要分布外（OOD）因素。基于这一诊断，我们提出了一种简单而有效的即时长度泛化方法——LM-Infinite。该方法仅需引入一个Λ形注意力掩码（以避免过度关注标记）和距离限制（以避免未出现过的距离），无需更新参数或进行学习。我们发现该方法可适用于采用相对位置编码的多种LLMs。LM-Infinite在时间和空间复杂度上均为$O(n)$，在ArXiv和OpenWebText2数据集上，其生成的文本流畅度和质量在高达32k标记长度时仍保持一致，解码速度提升2.72倍。在密钥检索等下游任务中，当输入长度远超训练长度时，该方法仍能正常工作，而原始模型在此类情况下则直接失效。