LLM inference for popular enterprise use cases, such as summarization, RAG, and code-generation, typically observes orders of magnitude longer prompt lengths than generation lengths. This characteristic leads to high cost of prefill and increased response latency. In this paper, we present SwiftKV, a novel model transformation and distillation procedure specifically designed to reduce the time and cost of processing prompt tokens while preserving high quality of generated tokens. SwiftKV combines three key mechanisms: i) SingleInputKV, which prefills later layers' KV cache using a much earlier layer's output, allowing prompt tokens to skip much of the model computation, ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the memory footprint and support larger batch size for higher throughput, and iii) a knowledge-preserving distillation procedure that can adapt existing LLMs for SwiftKV with minimal accuracy impact and low compute and data requirement. For Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5% while incurring minimum quality degradation across a wide range of tasks. In the end-to-end inference serving using an optimized vLLM implementation, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100 GPUs. Our training, inference, and model implementations are open-sourced and can be found through https://huggingface.co/collections/Snowflake/swiftkv-models-674f7d7474eb789e185d31cb.
翻译:针对企业级主流应用场景(如摘要生成、检索增强生成与代码生成)的大语言模型推理任务,其提示文本长度通常比生成文本长度高出数个数量级。这一特性导致预填充阶段计算成本高昂且响应延迟显著增加。本文提出SwiftKV——一种新颖的模型转换与蒸馏流程,专门设计用于降低提示词元的处理时间与计算成本,同时保持生成词元的高质量输出。SwiftKV融合了三大核心技术机制:i) SingleInputKV机制,通过使用更浅层的输出预填充深层键值缓存,使提示词元能够跳过大部分模型计算;ii) AcrossKV机制,通过合并相邻层的键值缓存以降低内存占用,支持更大批处理规模从而实现更高吞吐量;iii) 知识保留蒸馏流程,能够以极低的精度损失及有限的计算数据需求,将现有大语言模型适配至SwiftKV架构。在Llama-3.1-8B与70B模型上的实验表明,SwiftKV将预填充计算需求降低50%,键值缓存内存需求减少62.5%,同时在广泛任务中仅产生最小质量损失。基于优化vLLM实现的端到端推理服务中,SwiftKV实现了最高2倍的总吞吐量提升与60%的单输出词元时间降低。其归一化推理吞吐量可达惊人的560 TFlops/GPU,这意味着Llama-3.1-70B模型在4张H100 GPU上以16位精度运行时能达到16K词元/秒的生成速度。我们的训练、推理及模型实现均已开源,可通过https://huggingface.co/collections/Snowflake/swiftkv-models-674f7d7474eb789e185d31cb获取。