Large model training has been using recomputation to alleviate the memory pressure and pipelining to exploit the parallelism of data, tensor, and devices. The existing recomputation approaches may incur up to 40% overhead when training real-world models, e.g., the GPT model with 22B parameters. This is because they are executed on demand in the critical training path. In this paper, we design a new recomputation framework, Lynx, to reduce the overhead by overlapping the recomputation with communication occurring in training pipelines. It consists of an optimal scheduling algorithm (OPT) and a heuristic-based scheduling algorithm (HEU). OPT achieves a global optimum but suffers from a long search time. HEU was designed based on our observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all identical structures. HEU achieves a local optimum but reduces the search time by 99% compared to OPT. Our comprehensive evaluation using GPT models with 1.3B-20B parameters shows that both OPT and HEU outperform the state-of-the-art recomputation approaches (e.g., Megatron-LM and Checkmake) by 1.02-1.53x. HEU achieves a similar performance as OPT with a search time of 0.16s on average.
翻译:大模型训练已采用重计算技术缓解内存压力,并利用流水线技术挖掘数据、张量和设备的并行性。现有重计算方法在训练实际模型(如参数量达220亿的GPT模型)时可能产生高达40%的开销,这是因为它们均在关键训练路径上按需执行。本文设计了一种新型重计算框架Lynx,通过将重计算与训练流水线中的通信操作相重叠来降低开销。该框架包含最优调度算法(OPT)和基于启发式的调度算法(HEU)。OPT算法能达到全局最优但搜索时间较长。HEU算法基于我们对大型深度神经网络模型存在相同结构的观察而设计,使得我们可以对所有相同结构应用统一的调度策略。HEU算法能达到局部最优,且其搜索时间较OPT算法减少99%。我们使用参数量为13亿至200亿的GPT模型进行的综合评估表明,OPT与HEU算法均优于现有先进重计算方法(如Megatron-LM和Checkmake),性能提升达1.02-1.53倍。HEU算法在平均仅需0.16秒搜索时间的情况下,达到了与OPT算法相近的性能表现。