Rail-optimized network fabrics have become the de facto datacenter scale-out fabric for large-scale ML training. However, the use of high-radix electrical switches to provide all-to-all connectivity in rails imposes massive power and cost. We propose a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches. The key challenge is that optical switches support one-to-one connectivity at a time, limiting the fan-out of traffic in ML workloads using hybrid parallelisms. We overcome this through \emph{parallelism-driven rail reconfiguration}, which exploits the non-overlapping communication phases of different parallelism dimensions. This time-multiplexes a single set of physical ports across circuit configurations tailored to each phase within a training iteration. We design and implement Opus, a control plane that orchestrates this in-job reconfiguration of photonic rails at parallelism phase boundaries, and evaluate it on a physical OCS testbed, the Perlmutter supercomputer, and in simulation at up to 2,048 GPUs. Our results show that photonic rails can achieve over $23\times$ network power reduction and $4\times$ cost savings while incurring less than $6\%$ training overhead at production-relevant OCS reconfiguration latencies.
翻译:轨道优化网络架构已成为大规模机器学习训练事实上的数据中心横向扩展架构。然而,在轨道中使用高基数电交换机提供全连接会带来巨大的功耗和成本。我们提出重新思考轨道抽象:保留其通信语义,但通过光路交换机实现。关键挑战在于光交换机每次仅支持一对一连接,限制了采用混合并行策略的机器学习工作负载的流量扇出。我们通过"并行性驱动的轨道重配置"克服这一限制,该方法利用不同并行维度间非重叠的通信阶段。该技术将单组物理端口在训练迭代内各阶段定制的电路配置间进行时分复用。我们设计并实现了Opus——一个在并行阶段边界协调光子轨道作业内重配置的控制平面,并在物理光路交换机测试平台、Perlmutter超级计算机以及高达2,048个GPU的模拟环境中进行评估。结果表明,在生产级光路交换机重配置延迟下,光子轨道可实现超过23倍的网络功耗降低和4倍的成本节约,同时产生低于6%的训练开销。