The Mixture-of-Experts (MoE) architecture is crucial for scaling large language models, but its scalability is severely limited by inter-GPU communication bottlenecks in multi-GPU systems. Although overlapping communication with computation is a widely recognized optimization, its effective deployment still remains challenging, both in terms of performance and programmability. In this work, we identify the root cause as a fundamental abstraction mismatch between MoE's dynamic, irregular token-to-expert mapping and the static, address-centric communication model of modern GPUs, which necessitates a complex software mediation phase to resolve addresses before data transfers, limiting performance and software flexibility. To resolve this, we propose MoE-Hub, a hardware-software co-design that introduces a destination-agnostic communication paradigm. MoE-Hub decouples data transmission from address management, allowing producers to send data immediately after routing using only a logical destination, while address allocation and data-flow orchestration are handled transparently by lightweight hardware in the GPU hub. By hardware-accelerating the entire communication control plane, MoE-Hub enables seamless and transparent overlap. Our evaluation shows that MoE-Hub achieves 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedup over state-of-the-art systems.
翻译:混合专家(MoE)架构对于扩展大型语言模型至关重要,但其在多GPU系统中的可扩展性受到GPU间通信瓶颈的严重限制。尽管将通信与计算重叠是一种广泛认可的优化手段,但其有效部署在性能和可编程性方面仍面临挑战。本研究指出根本原因在于MoE动态、不规则的令牌到专家映射与现代GPU静态、以地址为中心的通信模型之间存在根本性的抽象不匹配,导致数据转移前需通过复杂的软件中介阶段解析地址,从而限制了性能和软件灵活性。为解决此问题,我们提出MoE-Hub这一软硬件协同设计方案,引入了一种与目的地无关的通信范式。MoE-Hub将数据传输与地址管理解耦,允许生产者根据路由立即发送数据(仅需指定逻辑目的地),而地址分配与数据流编排则由GPU集线器中的轻量级硬件透明处理。通过硬件加速整个通信控制平面,MoE-Hub实现了无缝透明的重叠。评估表明,相较于现有最优系统,MoE-Hub可实现每层1.40倍至3.08倍、端到端1.21倍至1.98倍的加速效果。