The increasing context window size in Large Language Models (LLMs), such as the GPT and LLaMA series, has improved their ability to tackle complex, long-text tasks, but at the cost of inference efficiency, particularly regarding memory and computational complexity. Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tokens needed for future text generation. In this paper, we propose an approach that enhances LLM efficiency without token loss by reducing the memory and computational load of less important tokens, rather than discarding them.We address two challenges: 1) investigating the distribution of important tokens in the context, discovering recent tokens are more important than distant tokens in context, and 2) optimizing resources for distant tokens by sharing attention scores across layers. The experiments show that our method saves $35\%$ KV cache without compromising the performance.
翻译:大型语言模型(如GPT和LLaMA系列)不断增长的上下文窗口尺寸提升了其处理复杂长文本任务的能力,但同时也以推理效率为代价,尤其是在内存和计算复杂度方面。现有方法(包括选择性令牌保留和基于窗口的注意力机制)虽能提升效率,却可能丢弃未来文本生成所需的重要令牌。本文提出一种方法,通过减少非重要令牌的内存和计算负载而非直接丢弃它们,来提升LLM效率且不损失令牌。我们解决了两个关键挑战:1)探究上下文中重要令牌的分布规律,发现近期令牌比远期令牌更为重要;2)通过跨层共享注意力分数来优化远期令牌的资源占用。实验表明,我们的方法在保持性能不变的前提下可节省$35\%$的KV缓存。