Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless special tokens (i.e., separators) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM's effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.
翻译:大语言模型(LLMs)在自然语言处理任务中展现出卓越性能。然而,由于其二次复杂度带来的巨大计算需求和推理速度问题,其庞大参数量构成了显著挑战。本研究发现一个重要规律:某些看似无意义的特殊标记(即分隔符)相较于语义明确的标记,对注意力分数的贡献呈现不成比例的高值。这一观察表明,分隔符之间的文本段信息可被有效压缩至分隔符本身而不会造成显著信息损失。基于此发现,我们提出了SepLLM——一种即插即用框架,通过压缩文本段并消除冗余标记来加速推理。此外,我们实现了用于训练加速的高效内核。在免训练、从头训练及后训练三种设置下的实验结果均验证了SepLLM的有效性。值得注意的是,在Llama-3-8B骨干网络上,SepLLM在GSM8K-CoT基准测试中实现了超过50%的KV缓存减少,同时保持相当的性能水平。在流式处理场景中,SepLLM能够有效处理长达400万甚至更多标记的序列,并保持稳定的语言建模能力。