Offline LLM inference seeks to maximize request processing under fixed budgets, making commodity GPU servers a promising choice. However, prior work typically considers offloading and parallelism in isolation, resulting in suboptimal performance. In this paper, we propose PipeMax, a high-throughput LLM inference system that integrates pipeline parallelism with offloading to overcome interconnect and memory constraints on GPU servers. Particularly, pipeline parallelism naturally incurs low communication overhead and keeps only one batch active on each GPU at a time, which enables offloading the KV cache of inactive batches. By coordinating computation with offloading data movement, PipeMax effectively expands GPU memory capacity and sustains large-batch execution. Experiments show that PipeMax achieves up to 2.51x higher throughput than vLLM, and up to 1.42x and 1.38x higher throughput than state-of-the-art high-throughput LLM systems, respectively, on an 8-GPU node.
翻译:离线LLM推理旨在固定预算下最大化请求处理效率,这使得商用GPU服务器成为有前景的选择。然而,现有工作通常将卸载与并行处理割裂考虑,导致性能次优。本文提出PipeMax——一种高吞吐量LLM推理系统,通过融合流水线并行与卸载技术,克服GPU服务器的互连与内存瓶颈。具体而言,流水线并行天然具有低通信开销特性,且每次仅让一个批次在单个GPU上处于活跃状态,这使得非活跃批次的KV缓存得以卸载。通过协调计算与卸载数据移动,PipeMax有效扩展了GPU内存容量并维持大批量执行。实验表明,在8-GPU节点上,PipeMax相较于vLLM实现最高2.51倍吞吐量提升,相较于现有最先进的高吞吐量LLM系统分别实现最高1.42倍和1.38倍吞吐量提升。