Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) caching is used to store intermediate activations, enabling GPUs to perform only the incremental computation required for each new token. This approach significantly lowers the computational overhead for token generation. However, the memory required for KV caching grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU. Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution, but they are hindered by excessive data movement and dependence on CPU capabilities. In this paper, we introduce an efficient CPU-GPU I/O-aware LLM inference method that avoids transferring the entire KV cache from CPU to GPU by recomputing partial KV cache from activations while concurrently transferring the remaining KV cache via PCIe bus. This approach overlaps GPU recomputation with data transfer to minimize idle GPU time and maximize inference performance. Our method is fully automated by integrating a profiler module that utilizes input characteristics and system hardware information, a scheduler module to optimize the distribution of computation and communication workloads, and a runtime module to efficiently execute the derived execution plan. Experimental results show that our method achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches.
翻译:大语言模型(LLM)的推理过程计算需求巨大。为降低自回归解码的成本,键值(KV)缓存被用于存储中间激活值,使GPU仅需执行每个新词元所需的增量计算。该方法显著降低了词元生成的计算开销。然而,KV缓存所需的内存快速增长,常常超出GPU内存容量。一种经济高效的替代方案是将KV缓存卸载至CPU内存,这缓解了GPU内存压力,但将瓶颈转移至CPU与GPU之间有限的PCIe连接带宽。现有方法试图通过重叠GPU计算与I/O操作或采用CPU-GPU异构执行来解决这些问题,但受到过多数据移动和对CPU性能依赖的制约。本文提出一种高效的CPU-GPU I/O感知LLM推理方法,该方法通过从激活值重计算部分KV缓存,同时通过PCIe总线传输剩余KV缓存,避免了将整个KV缓存从CPU传输至GPU。此方法将GPU重计算与数据传输重叠,以最小化GPU空闲时间并最大化推理性能。我们的方法通过集成分析器模块(利用输入特征和系统硬件信息)、调度器模块(优化计算与通信工作负载分配)以及运行时模块(高效执行生成的执行计划)实现全自动化。实验结果表明,与现有先进方法相比,我们的方法在解码阶段实现了高达35.8%的延迟降低和46.2%的吞吐量提升。