Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually become the bottleneck of DML. Current multi-tenant GPU clusters face network contention caused by hash-collision problem which not only further increases the overhead of communication, but also creates unfairness and affects the user experience. In this paper, we firstly analyse how network contention affects the training time in a cluster with 32 NVIDIA V100 GPUs. Then we propose vClos to eliminate network contention by jointly optimizing network topology and communication pattern in distributed training. An OCS-vClos which introduces a layer of optical circuit switches (OCSs) in the leaf-spine network is also proposed to reduce potential network resource fragmentation caused by resource allocation strategy in vClos. Testbed experiments and real-trace-based large-scale simulations are conducted to demonstrate the superiority of vClos over existing network resource scheduling strategies.

翻译：分布式机器学习（DML）技术能够在合理的时间内训练大型神经网络。然而，由于计算能力的增长速度远超网络容量，网络通信逐渐成为DML的瓶颈。当前多租户GPU集群面临由哈希碰撞问题引发的网络争用，这不仅进一步增加了通信开销，还导致不公平性并影响用户体验。本文首先分析了在包含32块NVIDIA V100 GPU的集群中网络争用对训练时间的影响。随后，我们提出vClos方案，通过联合优化分布式训练中的网络拓扑与通信模式来消除网络争用。此外，我们还提出了OCS-vClos，该方案在叶脊网络中引入一层光电路交换机（OCS），以减少vClos资源分配策略可能导致的潜在网络资源碎片化问题。通过测试床实验和基于真实轨迹的大规模仿真，验证了vClos相较于现有网络资源调度策略的优越性。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日