Large language models (LLMs) have emerged as a cornerstone in real-world applications with lengthy streaming inputs (e.g., LLM-driven agents). However, existing LLMs, pre-trained on sequences with a restricted maximum length, cannot process longer sequences due to the out-of-domain and distraction issues. Common solutions often involve continual pre-training on longer sequences, which will introduce expensive computational overhead and uncontrollable change in model capabilities. In this paper, we unveil the intrinsic capacity of LLMs for understanding extremely long sequences without any fine-tuning. To this end, we introduce a training-free memory-based method, InfLLM. Specifically, InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention computation. Thereby, InfLLM allows LLMs to efficiently process long sequences with a limited context window and well capture long-distance dependencies. Without any training, InfLLM enables LLMs that are pre-trained on sequences consisting of a few thousand tokens to achieve comparable performance with competitive baselines that continually train these LLMs on long sequences. Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies. Our code can be found in \url{https://github.com/thunlp/InfLLM}.
翻译:大语言模型已成为处理长流式输入(如LLM驱动智能体)实际应用的基石。然而,现有LLM在受限最大长度序列上预训练后,因领域外和注意力分散问题而无法处理更长序列。常见解决方案通常涉及对长序列进行持续预训练,这会引入昂贵的计算开销和不可控的模型能力变化。本文揭示了LLM无需微调即可理解超长序列的内在能力。为此,我们提出一种基于记忆的免训练方法InfLLM。具体而言,InfLLM将远端上下文存储至额外记忆单元,并采用高效机制查找与词元相关的单元以进行注意力计算。由此,InfLLM使LLM能够通过有限上下文窗口高效处理长序列,并有效捕捉长距离依赖关系。在完全免训练条件下,InfLLM使仅基于数千词元序列预训练的LLM,能够达到与对长序列持续训练的竞争基线模型相当的性能。即使序列长度扩展至$1,024$K,InfLLM仍能有效捕获长距离依赖。代码发布于\url{https://github.com/thunlp/InfLLM}。