Harvest: Adaptive Photonic Switching Schedules for Collective Communication in Scale-up Domains

As chip-to-chip silicon photonics gain traction for their bandwidth and energy efficiency, their circuit-switched nature raises a fundamental question for collective communication: when and how should the interconnect be reconfigured to realize these benefits? Establishing direct optical paths can reduce congestion and propagation delay, but each reconfiguration incurs non-negligible overhead, making naive per-step reconfiguration impractical. We present Harvest, a systematic approach for synthesizing topology reconfiguration schedules that minimize collective completion time in photonic interconnects. Given a collective communication algorithm and its fixed communication schedule, Harvest determines how the interconnect should evolve over the course of the collective, explicitly balancing reconfiguration delay against congestion and propagation delay. We reduce the synthesis problem into a dynamic program with an underlying topology optimization subproblem and show that the approach applies to arbitrary collective communication algorithms. Furthermore, we exploit the algorithmic structure of a well-known AllReduce algorithm (Recursive Doubling) to synthesize optimal reconfiguration schedules without using any optimizers. By parameterizing the formulation using reconfiguration delay, Harvest naturally adapts to various photonic technologies. Using packet-level and flow-level evaluations, as well as hardware emulation on commercial GPUs, we show that the schedules synthesized by Harvest significantly reduce collective completion time across multiple collective algorithms compared to static interconnects and reconfigure-every-step baselines.

翻译：随着芯片间硅光子技术凭借其带宽和能效优势日益受到关注，其电路交换特性引发了一个关于集体通信的根本性问题：应在何时以及如何重新配置互连架构以实现这些优势？建立直接光路径可以减少拥塞和传播延迟，但每次重新配置都会产生不可忽略的开销，使得简单的每步重配置方案不切实际。本文提出Harvest，一种用于合成拓扑重配置调度的系统化方法，旨在最小化光子互连中的集体完成时间。给定一个集体通信算法及其固定的通信调度，Harvest确定互连结构在集体通信过程中应如何演变，明确权衡重配置延迟与拥塞及传播延迟之间的平衡。我们将合成问题转化为一个包含底层拓扑优化子问题的动态规划，并证明该方法适用于任意集体通信算法。此外，我们利用一种知名AllReduce算法（递归倍增）的算法结构，在不使用任何优化器的情况下合成了最优重配置调度方案。通过使用重配置延迟对模型进行参数化，Harvest能自然适配多种光子技术。基于分组级和流级评估，以及在商用GPU上的硬件仿真实验，我们证明：相较于静态互连和每步重配置基线方案，由Harvest合成的调度方案能显著降低多种集体通信算法的集体完成时间。