Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.
翻译:优化长上下文大语言模型(LLMs)的推理过程日益重要,这源于Transformer架构存在的二次计算复杂度与线性内存开销问题。现有的近似推理方法——包括键值(KV)缓存丢弃、稀疏注意力与提示压缩——通常依赖于对词元或KV对重要性的粗略预测。本文通过引入一种利用小型草稿模型更精准预测词元及KV对重要性的近似LLM推理框架,对近期研究进行了统一与拓展。我们提出了新颖的理论与实证分析,论证了基于前瞻的重要性估计技术的有效性。在此框架内,我们提出了:(i)SpecKV——首个利用小型草稿模型进行前瞻以实现精准KV缓存丢弃的方法;(ii)SpecPC——利用草稿模型注意力激活来识别并丢弃次要提示词元的技术;(iii)SpecKV-PC——融合两种策略的级联压缩方法。在长上下文基准测试上的大量实验表明,我们的方法在保持内存使用、延迟与吞吐量方面同等效率提升的同时,始终能够获得比现有基线更高的准确率。