Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the distributed training of ultra-large models. However, directly deploying these systems often leads to sub-optimal training efficiency due to the complex model architectures and the strict device memory constraints. In this paper, we propose Optimal Sharded Data Parallel (OSDP), an automated parallel training system that combines the advantages from both data and model parallelism. Given the model description and the device information, OSDP makes trade-offs between the memory consumption and the hardware utilization, thus automatically generates the distributed computation graph and maximizes the overall system throughput. In addition, OSDP introduces operator splitting to further alleviate peak memory footprints during training with negligible overheads, which enables the trainability of larger models as well as the higher throughput. Extensive experimental results of OSDP on multiple different kinds of large-scale models demonstrate that the proposed strategy outperforms the state-of-the-art in multiple regards.
翻译:大规模深度学习模型在各种下游任务上带来了显著的性能提升。当前的数据并行与模型并行方法利用模型复制和分区技术来支持超大规模模型的分布式训练。然而,由于复杂的模型架构和严格的设备内存限制,直接部署这些系统往往会导致训练效率欠佳。本文提出最优分片数据并行(OSDP),一种结合数据并行与模型并行优势的自动化并行训练系统。给定模型描述和设备信息,OSDP在内存消耗与硬件利用率之间进行权衡,从而自动生成分布式计算图并最大化系统整体吞吐量。此外,OSDP引入算子拆分以进一步减轻训练过程中的峰值内存占用,且开销极低,这既支持更大模型的训练,也实现了更高吞吐量。针对多种不同类型大规模模型的广泛实验结果表明,所提策略在多个方面均优于现有最先进方法。