The expanding context windows in large language models (LLMs) have greatly enhanced their capabilities in various applications, but they also introduce significant challenges in maintaining low latency, particularly in Time to First Token (TTFT). This paper identifies that the sharp rise in TTFT as context length increases is predominantly driven by queuing delays, which are caused by the growing demands for GPU Key-Value (KV) cache allocation clashing with the limited availability of KV cache blocks. To address this issue, we propose LayerKV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance, while seamlessly integrating with existing parallelism strategies and scheduling techniques. Specifically, LayerKV introduces layer-wise KV block allocation, management, and offloading for fine-grained control over system memory, coupled with an SLO-aware scheduler to optimize overall Service Level Objectives (SLOs). Comprehensive evaluations on representative models, ranging from 7B to 70B parameters, across various GPU configurations, demonstrate that LayerKV improves TTFT latency up to 11x and reduces SLO violation rates by 28.7\%, significantly enhancing the user experience
翻译:大语言模型(LLM)不断扩展的上下文窗口极大地增强了其在各类应用中的能力,但也带来了维持低延迟的显著挑战,尤其是在首词生成时间(TTFT)方面。本文发现,随着上下文长度增加而急剧上升的TTFT主要由排队延迟驱动,这种延迟源于GPU键值(KV)缓存分配需求的增长与有限的KV缓存块可用性之间的冲突。为解决这一问题,我们提出了LayerKV,这是一种简单而有效的插件式方法,能够在无需额外硬件或不牺牲输出性能的前提下有效降低TTFT,同时与现有并行策略和调度技术无缝集成。具体而言,LayerKV引入了分层KV块分配、管理和卸载机制,以实现对系统内存的细粒度控制,并结合SLO感知调度器以优化整体服务级别目标(SLO)。在涵盖7B至70B参数的代表性模型及多种GPU配置上的综合评估表明,LayerKV可将TTFT延迟提升高达11倍,并将SLO违规率降低28.7%,显著改善了用户体验。