Large language models (LLMs) represented by chartGPT have achieved profound applications and breakthroughs in various fields. This demonstrates that LLMs with hundreds of billions or trillions of parameters will continue to transform our daily lives. However, training LLMs with super-large-scale parameters requires even larger and high-performance GPU clusters and continuous training periods lasting for months. Due to the inevitable hardware and software failures in large clusters, maintaining large-scale training sessions lasting more than a week has become extremely challenging. A significant amount of time is spent on tasks such as checkpoint saving and recovery, task restart submissions, and task anomaly checks, greatly reducing the efficiency of effective training. To address these issues, a novel fault-tolerant large model training system has been proposed, which we named TRANSOM. In this work, we have designed three key components: the Training pipeline Automatic Fault Tolerance and Recovery Mechanism (TOL), the Training Task Multi-dimensional Metric Automatic Anomaly Detection System (TEE), and the Training Checkpoint Asynchronous Access Automatic Fault Tolerance and Recovery Technology (TCE). Our preliminary results indicate that TRANSOM significantly accelerates the efficiency of large-scale LLMs training on clusters. For instance, the pre-training time for GPT-3 with 175B parameters has been reduced by 28%, and the checkpoint storage and recovery performance has improved by a factor of 20.
翻译:以ChatGPT为代表的大语言模型已在各个领域取得了深远应用与突破,这表明拥有数千亿乃至数万亿参数的大语言模型将持续改变我们的日常生活。然而,训练超大规模参数的大语言模型需要更大规模的高性能GPU集群,并需要持续数月的训练周期。由于大型集群中不可避免的硬件和软件故障,维持超过一周的大规模训练任务变得极具挑战性。大量时间被消耗在检查点保存与恢复、任务重启提交以及任务异常检查等环节,极大降低了有效训练效率。为解决这些问题,我们提出了一种名为TRANSOM的新型容错大模型训练系统。在本工作中,我们设计了三个关键组件:训练流水线自动容错与恢复机制(TOL)、训练任务多维指标自动异常检测系统(TEE)以及训练检查点异步访问自动容错与恢复技术(TCE)。初步结果表明,TRANSOM显著提升了集群上大规模大语言模型训练的效率。例如,175B参数的GPT-3预训练时间减少了28%,检查点存储与恢复性能提升了20倍。