Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBenchV2 and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.
翻译:内存与计算仍是长序列大语言模型推理的核心瓶颈,这源于自注意力机制的二次方计算成本以及不断增长的键值缓存。现有的内存受限推理策略(如量化、卸载或启发式键值逐出)要么需要高昂的协调开销,要么依赖于不可靠的基于注意力的重要性代理。我们提出TRIM-KV,一种通过在创建时通过轻量级保留门学习每个令牌内在重要性的新方法。每个门预测一个随时间衰减的标量保留分数,反映该令牌在特定层和注意力头中的长期效用。当超过内存预算时,低分令牌将被逐出,确保缓存始终包含最关键令牌。TRIM-KV通过蒸馏冻结大语言模型结合容量损失进行高效训练,仅需微调门控参数且增加可忽略的推理开销。在数学推理(GSM8K、MATH-500、AIME24)、过程生成(LongProc)、对话长记忆基准(LongMemEval)以及长上下文理解(LongBenchV2和SCBench)任务中,TRIM-KV持续优于强逐出方法与可学习检索基线,尤其在低内存场景下表现突出。值得注意的是,在某些设定中其甚至超越全缓存模型,表明选择性保留可作为一种正则化形式,抑制非信息性令牌的噪声干扰。定性分析进一步揭示,学习到的保留分数与人类直觉相符,能自然复现下沉令牌、滑动窗口及要点压缩等启发式策略而无需显式设计。除效率提升外,保留分数为理解层与注意力头的特定作用提供了新视角,为探索大语言模型可解释性开辟了新路径。