The widespread of Large Language Models (LLMs) marks a significant milestone in generative AI. Nevertheless, the increasing context length and batch size in offline LLM inference escalate the memory requirement of the key-value (KV) cache, which imposes a huge burden on the GPU VRAM, especially for resource-constraint scenarios (e.g., edge computing and personal devices). Several cost-effective solutions leverage host memory or SSDs to reduce storage costs for offline inference scenarios and improve the throughput. Nevertheless, they suffer from significant performance penalties imposed by intensive KV cache accesses due to limited PCIe bandwidth. To address these issues, we propose InstInfer, a novel LLM inference system that offloads the most performance-critical computation (i.e., attention in decoding phase) and data (i.e., KV cache) parts to Computational Storage Drives (CSDs), which minimize the enormous KV transfer overheads. InstInfer designs a dedicated flash-aware in-storage attention engine with KV cache management mechanisms to exploit the high internal bandwidths of CSDs instead of being limited by the PCIe bandwidth. The optimized P2P transmission between GPU and CSDs further reduces data migration overheads. Experimental results demonstrate that for a 13B model using an NVIDIA A6000 GPU, InstInfer improves throughput for long-sequence inference by up to 11.1$\times$, compared to existing SSD-based solutions such as FlexGen.
翻译:大型语言模型(LLM)的广泛应用标志着生成式人工智能的一个重要里程碑。然而,离线LLM推理中不断增长的上下文长度与批处理规模加剧了键值(KV)缓存的内存需求,这对GPU显存造成了巨大负担,尤其在资源受限场景(如边缘计算与个人设备)中更为突出。现有若干经济高效的解决方案利用主机内存或固态硬盘降低离线推理场景的存储成本并提升吞吐量,但由于PCIe带宽限制,密集的KV缓存访问仍导致显著的性能损失。为应对这些问题,我们提出InstInfer——一种创新的LLM推理系统,将最影响性能的计算单元(即解码阶段的注意力机制)与数据单元(即KV缓存)卸载至计算存储驱动器(CSD),从而最大限度降低海量KV数据传输开销。InstInfer设计了专有的闪存感知存内注意力引擎及KV缓存管理机制,以充分利用CSD的高内部带宽优势,突破PCIe带宽限制。同时,优化的GPU与CSD间点对点传输机制进一步降低了数据迁移开销。实验结果表明,在使用NVIDIA A6000 GPU运行13B参数模型的场景下,相较于FlexGen等现有基于SSD的解决方案,InstInfer在长序列推理任务中可实现最高达11.1$\times$的吞吐量提升。