Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that ASL, equipped with one-shot token selection, adaptively trades inference speed for accuracy, outperforming state-of-the-art layer-wise token pruning methods in difficult tasks.
翻译:由于大语言模型的广泛应用,其推理过程中的键值缓存缩减技术受到显著关注。近年来提出的众多方法中,按层词元剪枝策略——即在特定层中选择部分词元保留至KV缓存并剪除其余词元——是最流行的方案之一。现有方法通常使用预定义的固定层集合进行词元选择,这种设计缺乏灵活性:不同任务间的精度差异显著,且在键值检索等困难任务中性能明显下降。本文提出ASL方法,这是一种无需训练的KV缓存缩减方案,通过利用注意力分数排序的词元排名方差自适应选择剪枝层。该方法能在满足用户指定KV预算约束的同时平衡不同任务性能。ASL在预填充阶段运行,并可联合SnapKV等现有KV缓存缩减方法优化解码阶段。在InfiniteBench、RULER和NIAH基准上的评估表明,采用单次词元选择的ASL能够自适应地以推理速度换取精度,在困难任务中超越现有最先进的按层词元剪枝方法。