Datacenter network design plays a critical role in AI training by supporting scaling to thousands of accelerators. An open problem, designing a near-optimal throughput oriented network-topology, routing, and collectives-has not been achieved at scale and with broad applicability to physical/implementation constraints. We address this problem with a compelling use-case, Google's TPU v4/5p supercomputer where the topology may be reconfigured to achieve higher all-to-all throughput, supporting large, parallelized AI training. We show that the existing TPU networks leave terabytes per second of throughput on the table and we fill that gap. This paper presents Throughput Optimized Networks at Scale (TONS), an automated network synthesis framework that meets the high-throughput demands of modern computing. TONS formulates topology synthesis as a linear optimization problem that maximizes a throughput-centric proxy metric, using theory and heuristics to scale to thousands of nodes. We further introduce a deadlock-free routing scheme compatible with limited virtual channels and optical switch faults, enabling the synthesized topologies to realize their predicted throughput gains in simulation. Evaluating uniform random and all-to-all traffic, TONS networks have a geometric mean speedups of 2.1x and 1.6x, respectively, over the best TPU v4/5p torus variants.
翻译:数据中心网络设计通过支持数千个加速器扩展,在AI训练中发挥关键作用。如何设计接近最优吞吐量的网络拓扑、路由和集合通信方案,并使其在规模上广泛适用于物理/实现约束,仍是一个开放性问题。我们以谷歌TPU v4/5p超级计算机这一极具说服力的用例解决该问题——在该系统中,拓扑可被重新配置以实现更高的全连接吞吐量,从而支持大规模并行化AI训练。研究表明,现有TPU网络存在每秒数TB的吞吐量未得到充分利用,而我们填补了这一空白。本文提出大规模吞吐优化网络(TONS),一种满足现代计算高吞吐需求的自动化网络综合框架。TONS将拓扑综合表述为一个线性优化问题,通过最大化以吞吐量为中心的代理指标,并利用理论与启发式方法将其扩展至数千节点规模。我们进一步引入一种与有限虚拟通道和光交换故障兼容的无死锁路由方案,使综合拓扑在仿真中能够实现预期的吞吐提升。在均匀随机流量与全连接流量评估中,TONS网络相比最优TPU v4/5p环面变体,几何平均加速比分别达到2.1倍和1.6倍。