Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double-buffered communication for overlapping dispatch and combine phases. HT targets large batches (4096+ tokens) using hierarchical communication that aggregates tokens within NVLink domains before inter-node RDMA transmission. Both modes leverage Device API for both intra- and inter-node communications, taking advantage of its topology awareness and optimized GPU-initiated implementation. We evaluate NCCL EP on an H100-based cluster across multi-node configurations, demonstrating competitive LL kernel performance and presenting end-to-end results with vLLM integration. By building MoE communication natively within NCCL, NCCL EP provides a supported path for expert parallelism on current and emerging NVIDIA platforms.
翻译:混合专家(MoE)架构已成为扩展大型语言模型的关键,推动了诸如DeepEP、Hybrid-EP等专用设备发起通信库的开发。这些库展示了GPU发起的RDMA在MoE分发与合并操作中的性能优势。本文提出NCCL EP(专家并行),一个完全基于NCCL Device API构建的底层MoE通信库。NCCL EP提供统一的ncclEpDispatch与ncclEpCombine原语,支持C和Python接口,并支持推理解码的低延迟(LL)模式及训练与推理预填充的高吞吐(HT)模式。LL模式针对小批量(1-128个token),采用直接全对全RDMA+NVLink网格互联,通过双缓冲通信实现分发与合并阶段的重叠。HT模式针对大批量(4096个以上token),采用层级通信,在节点间RDMA传输前先聚合NVLink域内的token。两种模式均利用Device API实现节点内与节点间通信,充分发挥其拓扑感知能力与优化的GPU发起实现优势。我们在基于H100集群的多节点配置下评估NCCL EP,展示了具有竞争力的LL内核性能,并通过vLLM集成呈现端到端结果。通过在NCCL内部原生构建MoE通信,NCCL EP为当前及新兴NVIDIA平台上的专家并行提供了受支持的路径。