Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at \url{https://github.com/THU-MIG/PrefixKV}.
翻译:近年来,大型视觉语言模型因其在给定多样化多模态输入时展现出的强大生成与推理能力而迅速流行。然而,这些模型在推理过程中会产生显著的计算与内存开销,极大地阻碍了其在实际场景中的高效部署。由冗长的输入与输出序列所必需的大规模键值缓存,是导致高推理成本的主要原因之一。基于此,近期研究探索了多种减小KV缓存规模以提升效率的方法。这些方法虽有效,但普遍忽略了不同层间KV向量重要性分布的差异性,并在下一令牌预测时为每一层维持相同的缓存大小。这导致某些层的关键上下文信息严重丢失,进而引起显著的性能下降。为解决这一问题,我们提出了PrefixKV。该方法将确定所有层KV缓存大小的挑战,重新定义为搜索全局最优前缀配置的任务。通过一种基于二分搜索的自适应分层KV保留策略,可在每一层中最大程度地保留上下文信息,从而促进生成过程。大量实验表明,与现有方法相比,我们的方法实现了最先进的性能。它在推理效率与生成质量之间取得了优越的权衡,展现出实际应用的广阔前景。代码发布于 \url{https://github.com/THU-MIG/PrefixKV}。