Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source) routing architecture is fully portable and applicable to accelerate any low-diameter interconnect.
翻译:新型低直径网络拓扑结构(如Slim Fly)相比传统的Fat Tree、Clos或Dragonfly结构,具有显著的成本和功耗优势。为引领低直径网络的推广,我们设计、实现、部署并评估了首个真实世界的Slim Fly系统。我们聚焦于含200台服务器的测试集群的部署、管理及运维环节,并进行了详尽的性能分析。我们展示了简单布线及布线验证技术,同时提出了一种面向基于InfiniBand的低直径拓扑的新型高性能路由架构。实际基准测试表明,Slim Fly在深度神经网络训练、图分析及线性代数内核等现代工作负载中表现优异。相较于无阻塞Fat Tree,Slim Fly在可扩展性上更具优势,且在大型网络规模下能提供可比或更优的性能与更低成本。我们的工作可促进Slim Fly的部署,且所提(开源)路由架构完全可移植,适用于加速任何低直径互连系统。