Collective communication (CC) is critical for scaling distributed machine learning (DML). The predictable traffic patterns of DML present a great oppotunity for applying optical network technologies. Optical networks with reconfigurable topologies promise high bandwidth and low latency for collective communications. However, existing approaches face inherent limitations: static topologies are inefficient for dynamic communication patterns within CC algorithm, while frequent topology reconfiguration matching every step of the algorithm incurs significant overhead. In this paper, we propose SWOT, a demand-aware optical network framework that employs ``intra-collective reconfiguration'' to dynamically align network resources with CC traffic patterns. SWOT hides reconfiguration latency by overlapping it with data transmission through three key techniques: Heterogeneous Message Splitting, Asynchronous Overlapping, and Topology Bypassing. Extensive simulations demonstrate that SWOT reduces communication completion time up to 89.7% across diverse CC algorithm compared to static baselines, demonstrating strong robustness to varying optical resources and reconfiguration delay.
翻译:集合通信对于扩展分布式机器学习至关重要。分布式机器学习可预测的流量模式为应用光网络技术提供了重要机遇。具有可重配置拓扑的光网络为集合通信提供了高带宽和低延迟的潜力。然而,现有方法面临固有局限:静态拓扑难以适应集合通信算法内部的动态通信模式,而频繁地根据算法每一步进行拓扑重配置则会带来显著开销。本文提出SWOT,一种需求感知的光网络框架,它采用"集合内重配置"来动态调整网络资源以匹配集合通信流量模式。SWOT通过三项关键技术——异构消息分割、异步重叠和拓扑旁路——将重配置延迟与数据传输重叠,从而隐藏重配置延迟。大量仿真实验表明,与静态基线相比,SWOT在各种集合通信算法中将通信完成时间降低了高达89.7%,并且对不同光网络资源和重配置延迟表现出很强的鲁棒性。