The rapid adoption of large language models (LLMs) has shifted a substantial portion of inference workloads into throughput-oriented offline regimes, where fully utilizing GPU compute requires large batch sizes. However, existing deployments face a structural tension. Data parallelism (DP) scales throughput well but replicates model weights, leaving limited GPU memory for key-value (KV) cache and constraining batch size. Model parallelism reduces per-device weights, but requires fine-grained synchronization that erodes DP's independence and scheduling flexibility. We present SiDP, a memory-efficient data-parallel paradigm for offline LLM inference that treats weights as a bandwidth-backed shared resource inside a DP group. Instead of storing the full model on every GPU, SiDP organizes weights as a distributed pool: each layer is owned by a single GPU, and other replicas access its weights on demand via two complementary execution modes: a Weight-as-a-Service (WaS) mode that streams remote weights over NVLink into a small cache in the large-batch regime, and a Compute-as-a-Service (CaS) mode that ships activations to owners in the small-batch tail. Evaluated on NVIDIA H20, H200, and B200 GPUs with Qwen3-32B, Qwen2.5-72B, and Llama-3.1-70B, SiDP increases usable KV capacity by up to 1.8x under the same configurations, and converts this into up to 1.5x higher end-to-end throughput over baselines (vLLM) for offline workloads.
翻译:大语言模型(LLM)的快速普及使得相当一部分推理负载转向面向吞吐量的离线场景,在此类场景中,要充分利用GPU算力就需要较大的批处理规模。然而,现有部署方案面临结构性矛盾:数据并行(DP)虽能良好扩展吞吐量,但需复制模型权重,导致留给键值缓存(KV Cache)的GPU内存有限,从而制约批处理规模;模型并行虽能降低单设备权重占用,却要求细粒度同步,削弱了数据并行的独立性与调度灵活性。我们提出SiDP——一种面向离线LLM推理的高效内存数据并行范式,该范式将权重视为数据并行组内由带宽支撑的共享资源。SiDP并非在每个GPU上存储完整模型,而是将权重组织为分布式池:每个层由单个GPU拥有,其他副本通过两种互补执行模式按需访问其权重:一种是权重即服务(WaS)模式,该模式在大批量场景下通过NVLink将远程权重流式传输到小型缓存中;另一种是计算即服务(CaS)模式,该模式在小批量尾端场景中将激活值传输至权重所有者。在搭载Qwen3-32B、Qwen2.5-72B和Llama-3.1-70B模型的NVIDIA H20、H200及B200 GPU上评估表明,SiDP在相同配置下最高可将可用KV容量提升1.8倍,并转化为离线负载相较基线(vLLM)最高1.5倍的端到端吞吐量提升。