As large language models (LLMs) continue to scale up, mixture-of-experts (MoE) has become a common technology in SOTA models. MoE models rely on expert parallelism (EP) to alleviate memory bottleneck, which introduces all-to-all communication to dispatch and combine tokens across devices. However, in widely-adopted GPU clusters, high-overhead cross-node communication makes all-to-all expensive, hindering the adoption of EP. Recently, wafer-scale chips (WSCs) have emerged as a platform integrating numerous devices on a wafer-sized interposer. WSCs provide a unified high-performance network connecting all devices, presenting a promising potential for hosting MoE models. Yet, their network is restricted to a mesh topology, causing imbalanced communication pressure and performance loss. Moreover, the lack of on-wafer disk leads to high-overhead expert migration on the critical path. To fully unleash this potential, we first propose Entwined Ring Mapping (ER-Mapping), which co-designs the mapping of attention and MoE layers to balance communication pressure and achieve better performance. We find that under ER-Mapping, the distribution of cold and hot links in the attention and MoE layers is complementary. Therefore, to hide the migration overhead, we propose the Non-invasive Balancer (NI-Balancer), which splits a complete expert migration into multiple steps and alternately utilizes the cold links of both layers. Evaluation shows ER-Mapping achieves communication reduction up to 62%. NI-Balancer further delivers 54% and 22% improvements in MoE computation and communication, respectively. Compared with the SOTA NVL72 supernode, the WSC platform delivers an average 39% higher per-device MoE performance owing to its scalability to larger EP.
翻译:随着大语言模型(LLMs)规模的持续扩大,专家混合(MoE)已成为前沿模型中的常见技术。MoE模型依赖专家并行(EP)来缓解内存瓶颈,这引入了跨设备的全对全通信以分发和聚合令牌。然而,在广泛采用的GPU集群中,高开销的跨节点通信使得全对全操作代价高昂,阻碍了EP的采用。近年来,晶圆级芯片(WSCs)作为一种在晶圆级中介层上集成大量器件的平台崭露头角。WSCs提供了连接所有器件的统一高性能网络,为部署MoE模型展现出广阔潜力。但其网络受限于网格拓扑,导致通信压力不均和性能损失。此外,片上磁盘的缺失使得关键路径上的专家迁移开销较高。为充分释放这一潜力,我们首先提出交织环映射(ER-Mapping),通过协同设计注意力层与MoE层的映射来平衡通信压力并提升性能。我们发现,在ER-Mapping下,注意力层与MoE层中冷热链路的分布具有互补性。因此,为隐藏迁移开销,我们提出非侵入式平衡器(NI-Balancer),将完整的专家迁移拆分为多个步骤,并交替利用两层的冷链路。评估表明,ER-Mapping最高可实现62%的通信减少。NI-Balancer进一步使MoE计算和通信分别提升54%和22%。与前沿的NVL72超级节点相比,得益于其对更大规模EP的可扩展性,WSC平台在单器件MoE性能上平均高出39%。