Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source) routing architecture is fully portable and applicable to accelerate any low-diameter interconnect.
翻译:新型低直径网络拓扑如Slim Fly(SF)相比传统的Fat Tree、Clos或Dragonfly拓扑具有显著的成本和功耗优势。为引领低直径网络的推广应用,我们设计、实现、部署并评估了首个真实世界的SF网络系统。本文聚焦于包含200台服务器的实验集群的部署、管理和运维环节,并对其性能进行了细致分析。我们展示了简单布线及布线验证的技术方法,同时提出了一种基于InfiniBand的低直径拓扑高性能路由架构。实际基准测试表明,SF在深度神经网络训练、图分析或线性代数内核等现代工作负载中展现出卓越性能。在大规模网络场景下,SF在可扩展性方面优于无阻塞Fat Tree,且性能相当或更优,成本更低。本研究有助于推动SF的落地部署,其配套(开源)路由架构完全可移植,适用于加速任何低直径互连系统。