Deep learning is experiencing a rise in large-scale models. Training large-scale models is costly, prompting researchers to train large-scale models on commodity servers that more researchers can access. The massive number of parameters necessitates the use of model parallelism training methods. Existing studies focus on training with pipeline model parallelism. However, the tensor model parallelism (TMP) is inevitable when the model size keeps increasing, where frequent data-dependent communication and computation operations significantly reduce the training efficiency. In this paper, we present Oases, an automated TMP method with overlapped communication to accelerate large-scale model training on commodity servers. Oases proposes a fine-grained training operation schedule to maximize overlapping communication and computation that have data dependence. Additionally, we design the Oases planner that searches for the best model parameter partition strategy of TMP to achieve further accelerations. Unlike existing methods, Oases planner is tailored to model the cost of overlapped communication-computation operations. We evaluate Oases on various model settings and two commodity clusters, and compare Oases to four state-of-the-art implementations. Experimental results show that Oases achieves speedups of 1.01--1.48\(\times\) over the fastest baseline, and speedups of up to 1.95\(\times\) over Megatron.
翻译:深度学习正经历大规模模型的兴起。训练大规模模型成本高昂,促使研究者在更多研究者可访问的商品服务器上训练大规模模型。海量参数使得必须采用模型并行训练方法。现有研究主要集中于使用流水线模型并行进行训练。然而,当模型规模持续增大时,张量模型并行不可避免,其中频繁的数据依赖通信与计算操作会显著降低训练效率。本文提出Oases,一种具备重叠通信的自动化张量模型并行方法,用于加速商品服务器上的大规模模型训练。Oases提出细粒度训练操作调度,以最大化重叠具有数据依赖性的通信与计算。此外,我们设计了Oases规划器,用于搜索张量模型并行的最佳模型参数划分策略以实现进一步加速。与现有方法不同,Oases规划器专门用于对重叠通信-计算操作的成本进行建模。我们在多种模型配置和两个商品集群上评估Oases,并将其与四种先进实现方案进行比较。实验结果表明,Oases相比最快基线实现了1.01--1.48\(\times\)的加速,相比Megatron最高可实现1.95\(\times\)的加速。