Geo-distributed ML training can benefit many emerging ML scenarios (e.g., large model training, federated learning) with multi-regional cloud resources and wide area network. However, its efficiency is limited due to 2 challenges. First, efficient elastic scheduling of multi-regional cloud resources is usually missing, affecting resource utilization and performance of training. Second, training communication on WAN is still the main overhead, easily subjected to low bandwidth and high fluctuations of WAN. In this paper, we propose a framework, Cloudless-Training, to realize efficient PS-based geo-distributed ML training in 3 aspects. First, it uses a two-layer architecture with control and physical training planes to support elastic scheduling and communication for multi-regional clouds in a serverless maner.Second, it provides an elastic scheduling strategy that can deploy training workflows adaptively according to the heterogeneity of available cloud resources and distribution of pre-existing training datasets. Third, it provides 2 new synchronization strategies for training partitions among clouds, including asynchronous SGD with gradient accumulation (ASGD-GA) and inter-PS model averaging (MA). It is implemented with OpenFaaS and evaluated on Tencent Cloud. Experiments show that Cloudless-Training can support general ML training in a geo-distributed way, greatly improve resource utilization (e.g., 9.2%-24.0% training cost reduction) and synchronization efficiency (e.g., 1.7x training speedup over baseline at most) with model correctness guarantees.
翻译:地理分布式机器学习训练能够利用多区域云资源和广域网,为许多新兴的机器学习场景(如大模型训练、联邦学习)带来益处。然而,其效率受到两方面的制约:首先,多区域云资源的高效弹性调度常缺失,影响资源利用率和训练性能;其次,广域网上的训练通信仍是主要开销,容易遭受低带宽和高波动的影响。本文提出了一种名为“Cloudless-Training”的框架,通过三个方面实现基于参数服务器的高效地理分布式机器学习训练。首先,它采用包含控制平面和物理训练平面的双层架构,以无服务器方式支持多区域云的弹性调度与通信。其次,它提供了一种弹性调度策略,能根据可用云资源的异构性和预存训练数据集的分布,自适应地部署训练工作流。第三,它提出了两种新的云端训练分区同步策略,包括带有梯度累积的异步随机梯度下降和参数服务器间模型平均。该框架基于OpenFaaS实现,并在腾讯云上进行了评估。实验表明,Cloudless-Training能够以地理分布式方式支持通用机器学习训练,在保证模型准确性的前提下,大幅提高资源利用率(例如训练成本降低9.2%-24.0%)和同步效率(例如相较于基线最高实现1.7倍训练加速)。