The growing demand for efficient communication in distributed training and inference has sparked significant interest in reconfigurable photonic interconnects across both academia and industry. Mixture-of-Experts (MoE) models, with their highly skewed communication patterns, present a natural opportunity for such circuit-switched fabrics. However, existing approaches largely optimize communication in isolation, overlooking the interaction between communication and the expert computation that follows. In this paper, we revisit circuit scheduling for all-to-all communication in MoE execution. We show that the dispatch--compute--combine structure fundamentally challenges classical scheduling techniques such as Birkhoff--von Neumann (BvN) decomposition. First, MoE communication matrices are rarely doubly stochastic, introducing significant scheduling bubbles in BvN-based schedules. Second, while decomposition enables communication--compute overlap, the excessive number of matchings produced by BvN fragments execution into small batches, leading to severe compute inefficiencies due to fixed execution overheads. Motivated by these observations, we explore a simple greedy max-weight decomposition strategy that bounds the number of matchings while preserving large batch sizes per matching. Despite its simplicity, the approach significantly improves overlap efficiency, reduces compute overheads, and approaches the performance of an ideal congestion-free all-to-all.
翻译:分布式训练和推理中对高效通信日益增长的需求,促使学术界和工业界对可重构光子互连产生了浓厚兴趣。混合专家(MoE)模型因其高度倾斜的通信模式,天然适用于这种电路交换网络。然而,现有方法主要孤立地优化通信,忽视了通信与后续专家计算之间的交互。本文重新审视了MoE执行中全对全通信的电路调度问题,揭示了“分发-计算-合并”结构从根本上挑战了伯克霍夫-冯·诺依曼(BvN)分解等经典调度技术。首先,MoE通信矩阵极少是双随机的,这会在基于BvN的调度中引入显著的调度气泡。其次,虽然分解能够实现通信与计算的重叠,但BvN产生的过多匹配数会将执行碎片化为小批量,因固定执行开销导致严重的计算效率低下。基于上述观察,我们探索了一种简单的贪心最大权重分解策略,该策略在限制匹配数的同时保持每个匹配的大批量大小。尽管方法简单,但该策略显著提升了重叠效率、降低了计算开销,并接近理想无阻塞全对全通信的性能。