Scheduling Parallel Optical Circuit Switches for AI Training

The rapid growth of AI training has dramatically increased datacenter traffic demand and energy consumption, which has motivated renewed interest in optical circuit switches (OCSes) as a high-bandwidth, energy-efficient alternative for AI fabrics. Deploying multiple parallel OCSes is a leading alternative. However, efficiently scheduling time-varying traffic matrices across parallel optical switches with non-negligible reconfiguration delays remains an open challenge. We consider the problem of scheduling a single AI traffic demand matrix $D$ over $s$ parallel OCSes while minimizing the makespan under reconfiguration delay $δ$. Our algorithm Spectra relies on a three-step approach: Decompose $D$ into a minimal set of weighted permutations; Schedule these permutations across parallel switches using load-aware assignment; then Equalize the imbalanced loads on the switches via controlled permutation splitting. Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing) as well as standard benchmarks, Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan by an average factor of $1.4\times$ on GPT AI workloads, $1.9\times$ on MoE AI workloads, and $2.4\times$ on standard benchmarks. Further, the makespans achieved by Spectra consistently approach newly derived lower bounds.

翻译：AI训练的快速增长极大地增加了数据中心流量需求和能耗，这促使人们重新关注光路交换机（OCSes），将其作为AI互连架构中一种高带宽、高能效的替代方案。部署多个并行OCSes是一种主流替代方案。然而，在具有不可忽略的重配置延迟的并行光交换机上，高效调度时变流量矩阵仍然是一个开放挑战。我们考虑在重配置延迟$δ$下，调度单个AI流量需求矩阵$D$到$s$个并行OCSes上，同时最小化完工时间的问题。我们的算法Spectra采用三步法：将$D$分解为一组最小化的加权置换；使用负载感知分配将这些置换调度到并行交换机上；然后通过受控的置换分割来均衡交换机间的不平衡负载。在真实的AI训练负载（GPT模型和Qwen MoE专家路由）以及标准基准测试上的评估表明，Spectra大幅优于基于最先进算法的基线，在GPT AI负载上平均减少完工时间$1.4\times$，在MoE AI负载上减少$1.9\times$，在标准基准测试上减少$2.4\times$。此外，Spectra实现的完工时间持续逼近新推导出的下界。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

构建面向终端的 AI 编程智能体：脚手架、测试环境、上下文工程及实践经验

专知会员服务

26+阅读 · 3月8日

中文版 | 集中式与分布式多智能体AI协调策略

专知会员服务

23+阅读 · 2025年5月8日

《面向边缘AI应用的高性能高能效架构探索》156页

专知会员服务

37+阅读 · 2025年4月12日

《可重构智能表面 (RIS)：下一代一体化传感与通信的关键？》

专知会员服务

34+阅读 · 2024年3月14日