All-to-all collective communication is a core primitive in distributed machine learning and high-performance computing. At the server scale, the communication demands of these workloads are increasingly outstripping the bandwidth and energy limits of electrical interconnects, driving a growing interest in photonic interconnects. However, leveraging these interconnects for all-to-all communication is nontrivial. The core challenge lies in jointly optimizing a sequence of topologies and flow schedules, reconfiguring only when the transmission savings from traversing shorter paths outweigh the reconfiguration cost. Yet the search space of this joint optimization is enormous. Existing work sidesteps this challenge by making unrealistic assumptions on reconfiguration costs so that it is never or always worthwhile to reconfigure. In this paper, we show that any candidate sequence of topologies and flow schedules can be expressed as a sum of adjacency matrices and their powers. This abstraction captures the entire solution space and yields a lower bound on all-to-all completion time. Building on this formulation, we identify a family of topology sequences with strong symmetry and high expansion that admits bandwidth-efficient schedules, which our algorithm constructs with low computational overhead. Together, these insights allow us to efficiently construct near-optimal solutions, effectively avoiding enumeration of the combinatorial design space. Evaluation shows that our approach reduces all-to-all completion time by up to 44% on average across a wide range of network parameters, message sizes and workload types.
翻译:全对全集体通信是分布式机器学习与高性能计算中的核心原语。在服务器规模下,这类工作负载的通信需求日益超出电气互连的带宽与能耗极限,推动了对光子互连的广泛关注。然而,利用此类互连实现全对全通信并非易事。核心挑战在于联合优化拓扑序列与流调度策略,仅当经由更短路径传输所节省的开销超过重构成本时才执行重构。但该联合优化的搜索空间极为庞大。现有工作通过不切实际地假设重构成本来回避此挑战,使得重构要么永不必要、要么始终有益。本文证明,任何候选的拓扑序列与流调度均可表示为邻接矩阵及其幂次的和。该抽象捕获了整个解空间,并为全对全完成时间提供了下界。基于此形式化框架,我们识别出一类具有强对称性与高扩展性的拓扑序列族,其支持带宽高效的调度方案,且本算法能以较低计算开销构建此类调度。综合这些洞见,我们能够高效构建近似最优解,有效避免对组合设计空间的枚举。评估结果表明,在广泛的网络参数、消息大小与工作负载类型下,本方法平均可将全对全完成时间降低高达44%。