The maximum supported context length is a critical bottleneck limiting the practical application of the Large Language Model (LLM). Although existing length extrapolation methods can extend the context of LLMs to millions of tokens, these methods all have an explicit upper bound. In this work, we propose LongCache, a training-free approach that enables LLM to support an infinite context with finite context scope, through full-context cache selection and training-free integration. This effectively frees LLMs from the length extrapolation issue. We validate LongCache on the LongBench and L-Eval and demonstrate its performance is on par with traditional full-attention mechanisms. Furthermore, we have applied LongCache on mainstream LLMs, including LLaMA3 and Mistral-v0.3, enabling them to support context lengths of at least 400K in Needle-In-A-Haystack tests. We will improve the efficiency of LongCache by GPU-aware optimization soon.
翻译:最大支持上下文长度是限制大语言模型实际应用的关键瓶颈。尽管现有的长度外推方法可将大语言模型的上下文扩展至百万量级词元,但这些方法均存在明确的上限。本研究提出LongCache——一种无需训练的方法,通过全上下文缓存选择与免训练集成,使大语言模型能够在有限上下文范围内支持无限长度的上下文,从而有效解决长度外推问题。我们在LongBench和L-Eval基准测试中验证了LongCache的性能,证明其与传统全注意力机制表现相当。此外,我们已将LongCache应用于主流大语言模型(包括LLaMA3和Mistral-v0.3),使它们在"大海捞针"测试中至少能支持400K的上下文长度。我们将通过GPU感知优化进一步提升LongCache的运行效率。