Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.
翻译:大语言模型(LLM)服务面临严格的延迟服务级别目标(SLO)与有限的GPU内存容量之间的根本性矛盾。当高请求率耗尽KV缓存预算时,现有LLM推理系统常遭受严重的队头(HOL)阻塞。尽管先前工作探索了基于PCIe的卸载方案,但这些方法在高请求率下无法维持响应性,往往难以满足严格的首次令牌时间(TTFT)与令牌间时间(TBT)SLO要求。本文提出SuperInfer,一种专为新兴超级芯片(如NVIDIA GH200)设计的高性能LLM推理系统,其通过NVLink-C2C实现GPU-CPU紧耦合架构。SuperInfer引入了RotaSched——首个主动的、SLO感知的旋转调度器,通过旋转请求以维持超级芯片上的响应性;以及DuplexKV——一种优化的旋转引擎,支持在NVLink-C2C上实现全双工传输。在GH200平台上使用多种模型与数据集的评估表明,相较于最先进系统,SuperInfer将TTFT SLO达成率提升最高达74.7%,同时保持相当的TBT与吞吐量,这证明了SLO感知的调度与内存协同设计能够充分释放超级芯片在响应式LLM服务中的潜力。