Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source) routing architecture is fully portable and applicable to accelerate any low-diameter interconnect.
翻译:新型低直径网络拓扑结构(如细长飞行,Slim Fly)相较于传统的胖树、Clos或蜻蜓网络,在成本和功耗方面具有显著优势。为推动低直径网络的广泛应用,我们设计、实现、部署并评估了首个真实世界的Slim Fly网络实例。本文重点探讨了拥有200台服务器的测试集群的部署、管理与运维,并进行了细致的性能分析。我们展示了简单的布线与布线验证技术,以及一种面向基于InfiniBand的低直径拓扑的新型高性能路由架构。真实环境基准测试表明,Slim Fly在众多现代工作负载(如深度神经网络训练、图分析或线性代数核心运算)中表现出强劲性能。在网络扩展性方面,Slim Fly优于无阻塞胖树结构,同时在大规模网络配置下能提供相当或更优的性能及更低成本。本研究可促进Slim Fly的实际部署,其配套的(开源)路由架构完全可移植,适用于加速各类低直径互连网络。