Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the distributed training of ultra-large models. However, directly deploying these systems often leads to sub-optimal training efficiency due to the complex model architectures and the strict device memory constraints. In this paper, we propose Optimal Sharded Data Parallel (OSDP), an automated parallel training system that combines the advantages from both data and model parallelism. Given the model description and the device information, OSDP makes trade-offs between the memory consumption and the hardware utilization, thus automatically generates the distributed computation graph and maximizes the overall system throughput. In addition, OSDP introduces operator splitting to further alleviate peak memory footprints during training with negligible overheads, which enables the trainability of larger models as well as the higher throughput. Extensive experimental results of OSDP on multiple different kinds of large-scale models demonstrate that the proposed strategy outperforms the state-of-the-art in multiple regards. Our code is available at https://github.com/Youhe-Jiang/OptimalShardedDataParallel.
翻译:大规模深度学习模型在各类下游任务中显著提升了性能。当前的数据并行和模型并行方法利用模型复制与分区技术,支持超大规模模型的分布式训练。然而,由于复杂的模型架构和严格的设备内存限制,直接部署这些系统往往会导致训练效率次优。本文提出最优分片数据并行(OSDP),一种结合数据并行与模型并行优势的自动化并行训练系统。给定模型描述与设备信息,OSDP在内存消耗与硬件利用率之间进行权衡,自动生成分布式计算图并最大化系统整体吞吐量。此外,OSDP引入算子分割以进一步缓解训练过程中的峰值内存占用,且额外开销极小,从而支持更大规模模型的训练并提升吞吐量。在多种不同类型大规模模型上的大量实验结果表明,所提出的策略在多个方面优于现有最先进方法。相关代码已开源至 https://github.com/Youhe-Jiang/OptimalShardedDataParallel。