Many computational factors limit broader deployment of large language models. In this paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a computational shortcut that requires storing previous KV pairs during decoding. While existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs to dramatically reduce the memory footprint of the cache, they can have limited success in tasks that require recollecting a majority of previous tokens. To alleviate this issue, we propose LESS, a simple integration of a (nearly free) constant sized cache with eviction-based cache methods, such that all tokens can be queried at later decoding steps. Its ability to retain information throughout time shows merit on a variety of tasks where we demonstrate LESS can help reduce the performance gap from caching everything, sometimes even matching it, all while being efficient.
翻译:许多计算因素限制了大型语言模型的更广泛部署。本文重点关注由键值(KV)缓存带来的内存瓶颈,这是一种在解码过程中需要存储先前KV对的计算捷径。现有KV缓存方法通过剪枝或驱逐大量相对不重要的KV对来大幅减少缓存内存占用,但此类方法在需要回忆大部分先前令牌的任务中效果有限。为解决此问题,我们提出LESS,将(近乎零成本的)固定大小缓存与基于驱逐的缓存方法简单集成,使得所有令牌在后续解码步骤中均可被查询。其随时间保留信息的能力在各种任务中展现出优势——我们证明LESS可帮助缩小与完整缓存之间的性能差距,有时甚至能匹配完整缓存,同时保持高效性。