Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually become the bottleneck of DML. Current multi-tenant GPU clusters face network contention caused by hash-collision problem which not only further increases the overhead of communication, but also creates unfairness and affects the user experience. In this paper, we firstly analyse how network contention affects the training time in a cluster with 32 NVIDIA V100 GPUs. Then we propose vClos to eliminate network contention by jointly optimizing network topology and communication pattern in distributed training. An OCS-vClos which introduces a layer of optical circuit switches (OCSs) in the leaf-spine network is also proposed to reduce potential network resource fragmentation caused by resource allocation strategy in vClos. Testbed experiments and real-trace-based large-scale simulations are conducted to demonstrate the superiority of vClos over existing network resource scheduling strategies.
翻译:分布式机器学习(DML)技术能够在合理的时间内训练大型神经网络。然而,由于计算能力的增长速度远超网络容量,网络通信逐渐成为DML的瓶颈。当前多租户GPU集群面临由哈希碰撞问题引发的网络争用,这不仅进一步增加了通信开销,还导致不公平性并影响用户体验。本文首先分析了在包含32块NVIDIA V100 GPU的集群中网络争用对训练时间的影响。随后,我们提出vClos方案,通过联合优化分布式训练中的网络拓扑与通信模式来消除网络争用。此外,我们还提出了OCS-vClos,该方案在叶脊网络中引入一层光电路交换机(OCS),以减少vClos资源分配策略可能导致的潜在网络资源碎片化问题。通过测试床实验和基于真实轨迹的大规模仿真,验证了vClos相较于现有网络资源调度策略的优越性。