Rail-optimized network fabrics have become the de facto datacenter scale-out fabric for large-scale ML training. However, the use of high-radix electrical switches to provide all-to-all connectivity in rails imposes massive power and cost. We propose a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches. The key challenge is that optical switches support one-to-one connectivity at a time, limiting the fan-out of traffic in ML workloads using hybrid parallelisms. We overcome this through \emph{parallelism-driven rail reconfiguration}, which exploits the non-overlapping communication phases of different parallelism dimensions. This time-multiplexes a single set of physical ports across circuit configurations tailored to each phase within a training iteration. We design and implement Opus, a control plane that orchestrates this in-job reconfiguration of photonic rails at parallelism phase boundaries, and evaluate it on a physical OCS testbed, the Perlmutter supercomputer, and in simulation at up to 2,048 GPUs. Our results show that photonic rails can achieve over $23\times$ network power reduction and $4\times$ cost savings while incurring less than $6\%$ training overhead at production-relevant OCS reconfiguration latencies.
翻译:轨道优化网络结构已成为大规模机器学习训练中事实上的数据中心横向扩展架构。然而,在轨道网络中采用高基数电交换机实现全连接会带来巨大的功耗与成本。我们提出一种对轨道抽象的重构思路:保留其通信语义,但通过光路开关实现其功能。核心挑战在于光开关仅支持点对点连接,限制了采用混合并行策略的ML工作负载的流量扇出能力。我们通过\emph{并行驱动轨道重配置}克服这一限制,该技术利用不同并行维度间非重叠的通信阶段,将单组物理端口在训练迭代的各阶段内按定制化电路配置进行时分复用。我们设计并实现了Opus控制平面,该系统在并行阶段边界协调光子轨道的任务内重配置,并在物理OCS测试平台、Perlmutter超级计算机以及最高2,048个GPU规模的仿真环境中进行评估。实验结果表明,在生产级OCS重配置延迟下,光子轨道可实现超过$23\times$的网络功耗降低与$4\times$的成本节约,同时训练开销低于$6\%$。