Collective communication (CC) is critical for scaling distributed machine learning (DML). The predictable traffic patterns of DML present a great opportunity for applying optical network technologies. Optical networks with reconfigurable topologies promise high bandwidth and low latency for collective communications. However, existing approaches face inherent limitations: static topologies are inefficient for dynamic communication patterns within CC algorithm, while frequent topology reconfiguration matching every step of the algorithm incurs significant overhead. In this paper, we propose SWOT, a demand-aware optical network framework that employs ``intra-collective reconfiguration'' to dynamically align network resources with CC traffic patterns. SWOT hides reconfiguration latency by overlapping it with data transmission through three key techniques: \textit{Heterogeneous Message Splitting}, \textit{Asynchronous Overlapping}, and \textit{Topology Bypassing}. Extensive simulations demonstrate that SWOT reduces communication completion time up to 89.7% across diverse CC algorithm compared to static baselines, demonstrating strong robustness to varying optical resources and reconfiguration delay.
翻译:集合通信(CC)对于扩展分布式机器学习(DML)至关重要。DML中可预测的流量模式为应用光网络技术提供了重要机遇。具有可重构拓扑的光网络有望为集合通信实现高带宽与低延迟。然而,现有方法存在固有局限:静态拓扑难以高效应对CC算法中动态变化的通信模式,而为匹配算法每一步骤而频繁重构拓扑则会产生显著开销。本文提出SWOT——一种需求感知的光网络框架,通过"集合内重构"机制动态匹配CC流量模式。SWOT通过三项关键技术将重构延迟与数据传输相重叠:异构消息分割、异步重叠与拓扑旁路。大量仿真表明,与静态基线相比,SWOT在多种CC算法下可将通信完成时间降低最高89.7%,并展现出对光资源数量与重构延迟变化的强鲁棒性。