TRANSOM: An Efficient Fault-Tolerant System for Training LLMs

Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by chatGPT, have achieved profound impact on various fields. However, training LLMs with super-large-scale parameters requires large high-performance GPU clusters and long training periods lasting for months. Due to the inevitable hardware and software failures in large-scale clusters, maintaining uninterrupted and long-duration training is extremely challenging. As a result, A substantial amount of training time is devoted to task checkpoint saving and loading, task rescheduling and restart, and task manual anomaly checks, which greatly harms the overall training efficiency. To address these issues, we propose TRANSOM, a novel fault-tolerant LLM training system. In this work, we design three key subsystems: the training pipeline automatic fault tolerance and recovery mechanism named Transom Operator and Launcher (TOL), the training task multi-dimensional metric automatic anomaly detection system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous access automatic fault tolerance and recovery technology named Transom Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks, while TEE is responsible for task monitoring and anomaly reporting. TEE detects training anomalies and reports them to TOL, who automatically enters the fault tolerance strategy to eliminate abnormal nodes and restart the training task. And the asynchronous checkpoint saving and loading functionality provided by TCE greatly shorten the fault tolerance overhead. The experimental results indicate that TRANSOM significantly enhances the efficiency of large-scale LLM training on clusters. Specifically, the pre-training time for GPT3-175B has been reduced by 28%, while checkpoint saving and loading performance have improved by a factor of 20.

翻译：以chatGPT为代表、拥有数千亿甚至数万亿参数的大语言模型已在多个领域产生深远影响。然而，训练超大规模参数的大语言模型需要大型高性能GPU集群以及长达数月的训练周期。由于大规模集群中不可避免的软硬件故障，维持不间断的长时间训练极具挑战性。因此，大量训练时间被耗费在任务检查点的保存与加载、任务重调度与重启以及任务人工异常检查上，严重损害了整体训练效率。为解决这些问题，我们提出TRANSOM这一新型容错大语言模型训练系统。本工作设计了三个关键子系统：名为Transom Operator and Launcher (TOL)的训练流水线自动容错与恢复机制、名为Transom Eagle Eye (TEE)的训练任务多维度指标自动异常检测系统，以及名为Transom Checkpoint Engine (TCE)的训练检查点异步访问自动容错与恢复技术。其中，TOL管理训练任务的生命周期，TEE负责任务监控与异常报告。TEE检测到训练异常后将其报告给TOL，后者自动进入容错策略以剔除异常节点并重启训练任务。而TCE提供的异步检查点保存与加载功能则大幅缩短了容错开销。实验结果表明，TRANSOM显著提升了集群上大规模大语言模型训练的效率。具体而言，GPT3-175B的预训练时间减少了28%，检查点保存与加载性能提升了20倍。