Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.
翻译:大语言模型推理越来越依赖于多GPU执行,然而现有推理并行策略需要层级间的秩同步,使得端到端性能对工作负载不平衡敏感。我们提出DWDP(分布式权重数据并行),这是一种推理并行策略,它在保持数据并行执行的同时,将MoE权重卸载到对等GPU,并按需获取缺失的专家模块。通过消除集体秩同步,DWDP允许每个GPU独立推进。我们进一步通过两项优化(分裂权重管理和异步远程权重预取)来解决该设计实际开销问题。在TensorRT-LLM中实现,并在GB200 NVL72上使用DeepSeek-R1进行评估,在8K输入序列长度和1K输出序列长度下,20-100 TPS/用户服务范围内,DWDP在可比TPS/用户条件下将端到端输出TPS/GPU提升了8.8%。