Optimizing Large Model Training through Overlapped Activation Recomputation

Large model training has been using recomputation to alleviate the memory pressure and pipelining to exploit the parallelism of data, tensor, and devices. The existing recomputation approaches may incur up to 40% overhead when training real-world models, e.g., the GPT model with 22B parameters. This is because they are executed on demand in the critical training path. In this paper, we design a new recomputation framework, Lynx, to reduce the overhead by overlapping the recomputation with communication occurring in training pipelines. It consists of an optimal scheduling algorithm (OPT) and a heuristic-based scheduling algorithm (HEU). OPT achieves a global optimum but suffers from a long search time. HEU was designed based on our observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all identical structures. HEU achieves a local optimum but reduces the search time by 99% compared to OPT. Our comprehensive evaluation using GPT models with 1.3B-20B parameters shows that both OPT and HEU outperform the state-of-the-art recomputation approaches (e.g., Megatron-LM and Checkmake) by 1.02-1.53x. HEU achieves a similar performance as OPT with a search time of 0.16s on average.

翻译：大模型训练已采用重计算技术缓解内存压力，并利用流水线技术挖掘数据、张量和设备的并行性。现有重计算方法在训练实际模型（如参数量达220亿的GPT模型）时可能产生高达40%的开销，这是因为它们均在关键训练路径上按需执行。本文设计了一种新型重计算框架Lynx，通过将重计算与训练流水线中的通信操作相重叠来降低开销。该框架包含最优调度算法（OPT）和基于启发式的调度算法（HEU）。OPT算法能达到全局最优但搜索时间较长。HEU算法基于我们对大型深度神经网络模型存在相同结构的观察而设计，使得我们可以对所有相同结构应用统一的调度策略。HEU算法能达到局部最优，且其搜索时间较OPT算法减少99%。我们使用参数量为13亿至200亿的GPT模型进行的综合评估表明，OPT与HEU算法均优于现有先进重计算方法（如Megatron-LM和Checkmake），性能提升达1.02-1.53倍。HEU算法在平均仅需0.16秒搜索时间的情况下，达到了与OPT算法相近的性能表现。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/