To reduce memory consumption during LLM inference, prior works have proposed numerous methods that focus on KV cache pruning based on various criteria. While these techniques often accomplish lossless memory reduction on many datasets, they often rely on an under-emphasized condition: a dataset/domain-specific budget size threshold needs to be pre-determined to achieve the optimal performance. However, such input-specific tuning may be considerably limited in real-world scenarios, as open-domain inputs span diverse domains, lengths and difficulty levels, without clear boundaries for pre-tuning. Thus, the dependence of an input-sensitive threshold can be an inherent limitation that may cause large degradation on arbitrary inputs. In this work, we propose a new objective that lifts the threshold constraints for robust KV pruning, calling for "threshold-free" methods that automatically adjust budget sizes while ensuring full-cache performance. We then propose a novel method ReFreeKV as the first solution fulfilling this objective, validated by intensive experiments on 13 datasets of diverse context lengths, task types, and model sizes.
翻译:为降低大语言模型推理过程中的内存消耗,先前研究提出了多种基于不同准则的键值缓存剪枝方法。尽管这些技术常能在多数数据集上实现无损内存压缩,但它们往往依赖于一个未被充分强调的前提:需要预先确定数据集/领域特定的预算规模阈值才能达到最优性能。然而,在实际应用场景中,这种输入敏感的调优可能受到显著限制,因为开放域输入涵盖多样化的领域、文本长度和难度级别,缺乏明确的预调优边界。因此,对输入敏感阈值的依赖可能成为固有缺陷,导致在任意输入上产生显著的性能退化。本研究提出了一种新的目标,通过解除阈值约束来实现鲁棒的键值剪枝,倡导能够自动调整预算规模同时保证全缓存性能的“无阈值”方法。我们进而提出创新方法ReFreeKV作为实现该目标的首个解决方案,并通过在13个具有不同上下文长度、任务类型和模型规模的数据集上的密集实验验证了其有效性。