Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since scaling pre-trained language models improves downstream capability \cite{zhao2023survey}, the key-value (KV) cache becomes a dominant inference bottleneck. Recent KV cache compression methods \cite{jo2025fastkv,li2024snapkv,zhou2024dynamickv} reduce this cost by retaining only a subset of attention-relevant tokens. However, while these approaches preserve accuracy on benign workloads, their compression policies either fail to defend against jailbreak attacks \cite{jiang2024robustkv} or degrade safety alignment under aggressive eviction. We propose AnchorKV, a drop-in modification to KV cache compression that biases token retention scores away from directions in key space associated with harmful prompts. AnchorKV constructs an offline safety anchor by adapting a difference-of-means representation engineering approach \cite{arditi2024refusal,zou2023representation} to the layer-specific key projection space used in KV caching. Based on this anchor, a soft penalty token selection rule trades a small amount of utility for substantially improved safety alignment, while reducing to the original compressor when the penalty is zero.
翻译:大型语言模型(LLMs)在生成式推理和长上下文任务上超越了早期架构,但其庞大体积带来了显著的内存占用、能耗成本和设备端部署挑战。由于扩展预训练语言模型能提升下游能力 \cite{zhao2023survey},键值(KV)缓存成为主导推理瓶颈。近年来的KV缓存压缩方法 \cite{jo2025fastkv,li2024snapkv,zhou2024dynamickv} 通过仅保留与注意力相关令牌子集来降低此成本。然而,尽管这些方法能保持良性工作负载下的精度,其压缩策略要么无法防御越狱攻击 \cite{jiang2024robustkv},要么在激进驱逐策略下削弱安全对齐。我们提出AnchorKV,一种对KV缓存压缩的即插即用修改方案,该方案使令牌保留分数偏离与有害提示相关的键空间方向。AnchorKV通过将差异均值表示工程方法 \cite{arditi2024refusal,zou2023representation} 适配至KV缓存使用的层级键投影空间,构建离线安全锚点。基于该锚点,软惩罚令牌选择规则以少量实用性换取显著提升的安全对齐,当惩罚为零时退化为原始压缩器。