Recent megakernel designs for Mixture-of-Experts (MoE) inference fuse expert computation with fine-grained, GPU-initiated communication into a single persistent GPU kernel, and outperform collective-based MoE on a single node by overlapping data transfer with compute at tile granularity. This benefit does not carry over cleanly to multi-node inference, where experts span many nodes connected by an RDMA fabric. Communication-bound MoE models regress by up to $10\times$ on 8 nodes, and the regression worsens with node count. We trace this regression to hidden serialization in proxy-based RDMA transports. The ordering requirement between each tile transfer and its completion signal forces a fence that drains the NIC pipeline, and its cost grows with the number of concurrent transfers. As a result, models whose per-expert compute is too small to absorb this inflated network latency expose communication on the critical path. We present \emph{Perseus}, which eliminates this serialization through two techniques. \emph{Decoupled signaling} batches fences at per-destination granularity, reducing fence count by $8\times$. \emph{NIC-side ordering} replaces proxy stalls with hardware fence flags, so the proxy never blocks. On proxy-based transports, Perseus achieves up to 10.3$\times$ end-to-end speedup. Perseus on IBRC matches or exceeds IBGDA GPU-direct by up to 1.2$\times$, which shows that serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance.
翻译:近期面向混合专家(MoE)推理的兆内核设计将专家计算与细粒度、GPU发起的通信融合到单个持久GPU内核中,并通过以瓦片粒度重叠数据传输与计算,在单节点上超越了基于集合通信的MoE方案。然而,这一优势无法直接扩展到多节点推理场景(专家分布在由RDMA网络连接的多节点上)。受通信限制的MoE模型在8节点上性能下降高达$10\times$,且该退化随节点数增加而加剧。我们将此退化追溯至基于代理的RDMA传输中的隐藏序列化问题。每个瓦片传输与其完成信号之间的排序要求会强制触发栅栏操作以清空NIC流水线,其开销随并发传输数量增加而增长。因此,当每个专家的计算量过小无法吸收这种膨胀的网络延迟时,通信便会暴露在关键路径上。我们提出\textit{Perseus}方法,通过两种技术消除这种序列化:\textit{解耦信令}技术按目标粒度批量处理栅栏操作,将栅栏数量降低$8\times$;\textit{NIC端排序}技术用硬件栅栏标志替代代理阻塞,使得代理永不阻塞。在基于代理的传输方案上,Perseus实现了高达10.3$\times$的端到端加速。在IBRC上运行的Perseus性能匹配甚至超越IBGDA GPU直连方案(最高提升1.2$\times$),这表明限制多节点兆内核性能的关键是序列化问题,而非基于代理或GPU直连传输方案的选择。