Despite the recent success associated with Large Language Models~(LLMs), they are notably cost-prohibitive to deploy in resource-constrained environments due to their excessive memory and computational demands. In addition to model parameters, the key-value cache is also stored in GPU memory, growing linearly with batch size and sequence length. As a remedy, recent works have proposed various eviction policies for maintaining the overhead of key-value cache under a given budget. This paper embarks on the efficacy of existing eviction policies in terms of \textit{importance score calculation} and \textit{eviction scope construction}. We identify the deficiency of prior policies in these two aspects and introduce RoCo, a \underline{r}\underline{o}bust \underline{c}ache \underline{o}mission policy based on temporal attention scores and robustness measures. Extensive experimentation spanning prefilling and auto-regressive decoding stages validates the superiority of RoCo. Finally, we release EasyKV, a versatile software package dedicated to user-friendly key-value constrained generative inference. Code available at \url{https://github.com/DRSY/EasyKV}.
翻译:尽管大型语言模型(LLMs)近期取得了成功,但由于其过高的内存和计算需求,它们在资源受限环境中的部署成本极为高昂。除模型参数外,键值缓存也存储在GPU内存中,并随批次大小和序列长度线性增长。为此,近期研究提出了多种驱逐策略,以在给定预算下控制键值缓存的开销。本文从《重要性分数计算》和《驱逐范围构建》两方面探讨现有驱逐策略的有效性。我们识别出先前策略在这两个方面的不足,并引入RoCo——一种基于时间注意力分数和鲁棒性度量的稳健缓存忽略策略。涵盖预填充和自回归解码阶段的广泛实验验证了RoCo的优越性。最后,我们发布了EasyKV,一个专为键值约束生成式推理设计的用户友好型通用软件包。代码见:\url{https://github.com/DRSY/EasyKV}