Efficient, VRAM-Constrained xLM Inference on Clients

from arxiv, Accepted at MLSys 2026 (Industry Track). 17 pages, 7 figures, 9 tables. Code and artifacts available at: https://github.com/deepshnv/pipeshard-mlsys26-ae

To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelined sharding, a novel, benchmark-profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-of-experts (MoE) LLMs. Using a combination of model sharding at the sub-layer level, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llama$.$cpp implementation of three well-understood prior ideas (jointly called VLMOpt), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance. These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products - the In-Game Inferencing software development kit (IGI SDK) and the Cosmos-Reason1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: for interactive use, TTFT improves by up to 6.7x and TPS by up to 30x for LLMs, and CR1 inference's VRAM demand is down by 10x, while in batched mode, throughput improves by up to 8.2x, all compared to their respective aggressive baselines. This paper is accepted at the 9th MLSys Conference (Industry Track), 2026. Code and artifact available at: https://github.com/deepshnv/pipeshard-mlsys26-ae

翻译：为了引领下一轮客户端人工智能创新，亟需在客户端系统上实现高精度大语言模型（LLM）和视觉语言模型（VLM）（统称为xLM）的高效、无损推理。为此，我们提出流水线切分（pipelined sharding）技术，这是一种新颖的、基于基准测试（benchmark-profile）引导的CPU-GPU混合调度方法，可在VRAM受限条件下为密集型和混合专家（MoE）LLM实现高效推理。该方法结合子层级模型切分、CPU卸载、流水线式复制-计算以及VRAM中的优先级张量放置，在灵活适应系统和推理条件的同时，优化了首个令牌生成时间（TTFT）和每秒令牌数（TPS）指标。为实现高效、高精度的VLM推理，我们将流水线切分与基于llama$.$cpp实现的三种既有成熟技术（统称为VLMOpt）相结合，具体包括：视觉张量CPU卸载、闪存注意力（flash attention）以及视觉与语言模型VRAM重叠避免。这些优化旨在为NVIDIA两款重要产品的未来版本——游戏内推理软件开发工具包（IGI SDK）及Cosmos-Reason1（CR1）物理AI推理VLM——改进客户端xLM推理。在涵盖多个模型和客户端系统的严格评估中，亮点包括：与各自强基线相比，交互模式下LLM的TTFT提升高达6.7倍，TPS提升高达30倍；CR1推理的VRAM需求降低10倍；批处理模式下吞吐量提升高达8.2倍。本文已被第9届MLSys会议（行业分会，2026年）接收。代码与工件参见：https://github.com/deepshnv/pipeshard-mlsys26-ae