All-to-All(v) communication is a critical primitive in modern machine learning workloads, particularly mixture-of-experts (MoE) models. Unfortunately, efficient scheduling is challenging due to workload skew, heterogeneous two-tier fabrics, and incast congestion, compounded by the dynamic nature of MoE workloads, where traffic shifts every few hundred milliseconds. Existing schedulers are hardly scalable, incurring seconds to hours of synthesis time, making them impractical. We present FAST, an efficient All-to-All(v) scheduler. FAST addresses skew through intra-server rebalancing and enforces balanced, one-to-one scale-out transfers that avoid incast. Evaluated extensively on both NVIDIA H200 and AMD MI300X clusters, FAST consistently outperforms state-of-the-art solutions on skewed workloads while reducing synthesis time by orders of magnitude.
翻译:全对全(All-to-All)通信是现代机器学习工作负载中的关键原语,尤其在混合专家(MoE)模型中。然而,由于工作负载倾斜、异构双层网络结构以及多对一拥塞问题,加之MoE工作负载的动态特性(流量每几百毫秒即发生切换),高效调度面临严峻挑战。现有调度器可扩展性差,合成时间长达数秒至数小时,难以实际应用。本文提出FAST,一种高效的全对全调度器。FAST通过服务器内再平衡机制应对负载倾斜,并强制执行均衡的一对一横向扩展传输以避免多对一拥塞。在NVIDIA H200与AMD MI300X集群上的广泛评估表明,FAST在倾斜工作负载上持续优于现有最优方案,同时将合成时间降低数个数量级。