Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding, but as context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth. Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches and their variation across layers. We introduce KV-CoRE KV-cache Compressibility by Rank Evaluation), an SVD-based method for quantifying the data-dependent low-rank compressibility of kv-caches. KV-CoRE computes the optimal low-rank approximation under the Frobenius norm and, being gradient-free and incremental, enables efficient dataset-level, layer-wise evaluation. Using this method, we analyze multiple models and datasets spanning five English domains and sixteen languages, uncovering systematic patterns that link compressibility to model architecture, training data, and language coverage. As part of this analysis, we employ the Normalized Effective Rank as a metric of compressibility and show that it correlates strongly with performance degradation under compression. Our study establishes a principled evaluation framework and the first large-scale benchmark of kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression and data-centric model development.
翻译:大型语言模型依赖KV-Cache来避免自回归解码过程中的冗余计算,但随着上下文长度增长,读写缓存会迅速使GPU内存带宽饱和。近期研究已探索KV-Cache压缩,但多数方法忽略了KV-Cache的数据依赖特性及其在不同层间的变化。我们提出KV-CoRE(基于秩评估的KV-Cache可压缩性),这是一种基于奇异值分解的方法,用于量化KV-Cache的数据依赖低秩可压缩性。KV-CoRE在Frobenius范数下计算最优低秩近似,且无需梯度计算并支持增量更新,能够实现高效的数据集级、逐层评估。运用该方法,我们分析了涵盖五个英文领域和十六种语言的多个模型与数据集,揭示了将可压缩性与模型架构、训练数据和语言覆盖范围相联系的系统性规律。在此分析中,我们采用归一化有效秩作为可压缩性度量指标,并证明其与压缩下的性能衰退存在强相关性。本研究建立了原则性评估框架及首个LLM中KV-Cache可压缩性的大规模基准,为动态数据感知压缩和数据中心的模型开发提供了重要见解。