The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memory swapping often results in higher latency and lower throughput. This paper introduces Pie, an LLM inference framework that addresses these challenges with performance-transparent swapping and adaptive expansion. By leveraging predictable memory access patterns and the high bandwidth of modern hardware like the NVIDIA GH200 Grace Hopper Superchip, Pie enables concurrent data swapping without affecting foreground computation, expanding effective memory without added latency. Adaptive expansion dynamically adjusts CPU memory allocation based on real-time information, optimizing memory usage and performance under varying conditions. Pie maintains low computation latency, high throughput, and high elasticity. Our experimental evaluation demonstrates that Pie achieves optimal swapping policy during cache warmup and effectively balances increased memory capacity with negligible impact on computation. With its extended capacity, Pie outperforms vLLM by up to 1.9X in throughput and 2X in latency. Additionally, Pie can reduce GPU memory usage by up to 1.67X while maintaining the same performance. Compared to FlexGen, an offline profiling-based swapping solution, Pie achieves magnitudes lower latency and 9.4X higher throughput.
翻译:大语言模型(LLM)的快速发展已彻底变革了自然语言处理与人工智能分析领域,但其不断增长的模型规模与内存需求也带来了严峻挑战。一种常见的解决方案是将数据溢出至CPU内存,然而传统的GPU-CPU内存交换往往导致更高的延迟与更低的吞吐量。本文提出Pie——一种面向LLM推理的框架,通过性能透明的交换机制与自适应扩展技术应对上述挑战。该框架利用可预测的内存访问模式及现代硬件(如NVIDIA GH200 Grace Hopper超级芯片)的高带宽特性,在不影响前台计算的前提下实现并发数据交换,从而在无额外延迟代价的情况下扩展有效内存容量。自适应扩展机制能依据实时信息动态调整CPU内存分配,在多变场景下优化内存使用与系统性能。Pie始终保持低计算延迟、高吞吐量与高弹性。实验评估表明,Pie在缓存预热阶段实现了最优交换策略,在显著提升内存容量的同时将计算影响降至可忽略水平。凭借其扩展容量,Pie的吞吐量最高可达vLLM的1.9倍,延迟降低至其一半。此外,在保持同等性能的前提下,Pie可降低GPU内存使用达1.67倍。相较于基于离线性能分析的交换方案FlexGen,Pie实现了数量级更低的延迟与9.4倍的吞吐量提升。