All-to-All(v) communication is a critical primitive in modern machine learning workloads, particularly mixture-of-experts (MoE) models. Unfortunately, efficient scheduling is challenging due to workload skew, heterogeneous two-tier fabrics, and incast congestion, compounded by the dynamic nature of MoE workloads, where traffic shifts every few hundred milliseconds. Existing schedulers are hardly scalable, incurring seconds to hours of synthesis time, making them impractical. We present FAST, an efficient All-to-All(v) scheduler. FAST addresses skew through intra-server rebalancing and enforces balanced, one-to-one scale-out transfers that avoid incast. Evaluated extensively on both NVIDIA H200 and AMD MI300X clusters, FAST consistently outperforms state-of-the-art solutions on skewed workloads while reducing synthesis time by orders of magnitude.
翻译:All-to-All(v)通信是现代机器学习负载,尤其是专家混合(MoE)模型中的关键原语。然而,由于负载倾斜、异构双层网络架构以及输入拥塞等问题,加之MoE负载的动态特性(流量每数百毫秒就会发生变化),高效调度极具挑战性。现有调度器几乎不具备可扩展性,综合时间从数秒到数小时不等,使其难以实际应用。本文提出FAST,一种高效的All-to-All(v)调度器。FAST通过服务器内再平衡机制应对负载倾斜,并强制执行均衡的一对一横向扩展传输,从而避免输入拥塞。在NVIDIA H200和AMD MI300X集群上的广泛评估表明,FAST在倾斜负载上始终优于现有最优解决方案,同时将综合时间降低了数个数量级。