Recently, large models have achieved the state of the art performances in various fields. In order to support large model training, we have to use distributed training techniques. However, finding an efficient distributed execution plan not only requires fine-grained model statistics, such as memory and computing overhead of each operator but also is a labor-intensive task even for an expert in the field of distributed training. In this paper, we introduce MAP, a compiler built upon PyTorch to implement Memory-aware Automated Parallelization. To profiling operator costs, existing training systems and machine learning pipelines either physically execute with respect to each operand or estimate the memory usage with a scaled input tensor, which are often time-consuming and misleading. Compared with existing methods, MAP provides an easy-to-use symbolic profiler to generate memory and computing statistics of an arbitrary PyTorch model with trivial time cost, so it will boost high productivity for ML developers. In addition, MAP can also seamlessly speed up different static planning tasks on computation graphs for PyTorch, and requires only a few lines of modification to user code to generate a new module instance that has a top-performing distributed execution plan. The source code is publicly available at https://github.com/hpcaitech/ColossalAI
翻译:近期,大模型已在多个领域实现了最先进的性能。为支持大模型训练,我们必须采用分布式训练技术。然而,即使对分布式训练领域的专家而言,寻找高效的分布式执行计划不仅需要细粒度的模型统计信息(如每个算子的内存和计算开销),也是一项劳动密集型任务。本文提出MAP——一个基于PyTorch构建的编译器,用于实现内存感知自动化并行化。在分析算子开销时,现有训练系统和机器学习流水线要么对每个操作数进行物理执行,要么通过缩放输入张量估算内存使用量,这些方法往往耗时且具有误导性。与现有方法相比,MAP提供易用的符号化分析工具,能以极低的时间成本生成任意PyTorch模型的内存与计算统计信息,从而显著提升机器学习开发者的生产力。此外,MAP还能无缝加速PyTorch计算图上的各类静态规划任务,仅需修改用户代码数行即可生成具备顶级分布式执行计划的新模块实例。源代码已发布于https://github.com/hpcaitech/ColossalAI