Current techniques and systems for distributed model training mostly assume that clusters are comprised of homogeneous servers with a constant resource availability. However, cluster heterogeneity is pervasive in computing infrastructure, and is a fundamental characteristic of low-cost transient resources (such as EC2 spot instances). In this paper, we develop a dynamic batching technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on its resource availability and throughput. Our mini-batch controller seeks to equalize iteration times on all workers, and facilitates training on clusters comprised of servers with different amounts of CPU and GPU resources. This variable mini-batch technique uses proportional control and ideas from PID controllers to find stable mini-batch sizes. Our empirical evaluation shows that dynamic batching can reduce model training times by more than 4x on heterogeneous clusters.
翻译:当前分布式模型训练的技术与系统大多假设集群由资源恒定可用的同构服务器组成。然而,集群异构性在计算基础设施中普遍存在,并且是低成本临时性资源(如EC2竞价实例)的基本特征。本文提出一种用于分布式数据并行训练的动态批处理技术,该技术根据各工作节点的资源可用性和吞吐量调整其小批量规模。我们的批控制器旨在均衡所有工作节点上的迭代时间,从而促进在包含不同CPU和GPU资源容量的服务器集群上进行训练。这种可变小批量技术采用比例控制及PID控制器的思想,以确定稳定的小批量规模。实验评估表明,在异构集群上,动态批处理可将模型训练时间降低超过4倍。