AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes

Many distributed training techniques like Parameter Server and AllReduce have been proposed to take advantage of the increasingly large data and rich features. However, stragglers frequently occur in distributed training due to resource contention and hardware heterogeneity, which significantly hampers the training efficiency. Previous works only address part of the stragglers and could not adaptively solve various stragglers in practice. Additionally, it is challenging to use a systematic framework to address all stragglers because different stragglers require diverse data allocation and fault-tolerance mechanisms. Therefore, this paper proposes a unified distributed training framework called AntDT (Ant Distributed Training Framework) to adaptively solve the straggler problems. Firstly, the framework consists of four components, including the Stateful Dynamic Data Sharding service, Monitor, Controller, and Agent. These components work collaboratively to efficiently distribute workloads and provide a range of pre-defined straggler mitigation methods with fault tolerance, thereby hiding messy details of data allocation and fault handling. Secondly, the framework provides a high degree of flexibility, allowing for the customization of straggler mitigation solutions based on the specific circumstances of the cluster. Leveraging this flexibility, we introduce two straggler mitigation solutions, namely AntDT-ND for non-dedicated clusters and AntDT-DD for dedicated clusters, as practical examples to resolve various types of stragglers at Ant Group. Justified by our comprehensive experiments and industrial deployment statistics, AntDT outperforms other SOTA methods more than 3x in terms of training efficiency. Additionally, in Alipay's homepage recommendation scenario, using AntDT reduces the training duration of the ranking model from 27.8 hours to just 5.4 hours.

翻译：诸多分布式训练技术（如参数服务器和AllReduce）旨在充分利用日益增长的海量数据与丰富特征。然而，资源竞争和硬件异构性导致分布式训练中频繁出现落伍节点（straggler），严重制约训练效率。现有工作仅能解决部分落伍问题，无法自适应应对实际场景中的多种落伍现象。此外，不同落伍节点需要差异化的数据分配与容错机制，这使得构建统一框架解决所有落伍问题颇具挑战。为此，本文提出名为AntDT（蚂蚁分布式训练框架）的统一分布式训练框架，以自适应解决落伍问题。首先，该框架包含四个组件：有状态动态数据分片服务、监控器、控制器和代理。这些组件协同工作，高效分配工作负载，并提供一系列预定义的落伍缓解方法（含容错机制），从而屏蔽数据分配与故障处理的复杂细节。其次，框架具备高度灵活性，允许根据集群具体条件定制落伍缓解方案。基于此灵活性，我们提出两种落伍缓解方案——适用于非专用集群的AntDT-ND与适用于专用集群的AntDT-DD，作为蚂蚁集团中解决各类落伍问题的实践案例。综合实验与工业部署统计表明，AntDT在训练效率上较其他SOTA方法提升3倍以上。此外，在支付宝首页推荐场景中，采用AntDT可将排序模型的训练时长从27.8小时缩短至仅5.4小时。