In recent years, large-scale models have demonstrated state-of-the-art performance across various domains. However, training such models requires various techniques to address the problem of limited computing power and memory on devices such as GPUs. Some commonly used techniques include pipeline parallelism, tensor parallelism, and activation checkpointing. While existing works have focused on finding efficient distributed execution plans (Zheng et al. 2022) and activation checkpoint scheduling (Herrmann et al. 2019, Beaumont et al. 2021}, there has been no method proposed to optimize these two plans jointly. Moreover, ahead-of-time compilation relies heavily on accurate memory and computing overhead estimation, which is often time-consuming and misleading. Existing training systems and machine learning pipelines either physically execute each operand or estimate memory usage with a scaled input tensor. To address these challenges, we introduce a system that can jointly optimize distributed execution and gradient checkpointing plans. Additionally, we provide an easy-to-use symbolic profiler that generates memory and computing statistics for any PyTorch model with a minimal time cost. Our approach allows users to parallelize their model training on the given hardware with minimum code change based. The source code is publicly available at Colossal-AI GitHub or https://github.com/hpcaitech/ColossalAI
翻译:近年来,大规模模型在多个领域展现出最先进的性能。然而,训练此类模型需要多种技术来解决GPU等设备上算力和内存有限的问题。常用技术包括流水线并行、张量并行和激活检查点。现有工作聚焦于寻找高效分布式执行计划(Zheng等人,2022)和激活检查点调度方案(Herrmann等人,2019;Beaumont等人,2021),但尚未提出联合优化这两种计划的方法。此外,提前编译高度依赖准确的内存与计算开销估算,这通常既耗时又易产生误导。现有训练系统和机器学习流水线要么物理执行每个操作数,要么通过缩放输入张量估算内存使用量。为应对这些挑战,我们提出一种能够联合优化分布式执行与梯度检查点计划的系统,并提供一个易用的符号性能分析器,能以最小时间成本生成任意PyTorch模型的内存与计算统计信息。该方法允许用户在给定硬件上以最小代码改动实现模型训练的并行化。源代码已开源至Colossal-AI GitHub或https://github.com/hpcaitech/ColossalAI。