The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.
翻译:大语言模型(LLM)的训练规模已达到数万块GPU,并且仍在持续扩展,从而能够更快地学习更大的模型。伴随资源规模扩张而来的是故障(CUDA错误、NaN值、作业挂起等)的普遍发生,这对训练稳定性构成了重大挑战。任何大规模LLM训练基础设施都应致力于实现最小的训练中断、高效的故障诊断和有效的容错能力,以实现高效的持续训练。本文介绍了ByteRobust,一个专为LLM稳健稳定训练而定制的大规模GPU基础设施管理系统。它利用了LLM训练过程的独特性,并将以常规方式检测和恢复故障作为首要任务。借助LLM训练的并行性和特点,ByteRobust通过有效的数据驱动方法实现了高容量容错、快速故障定界与定位,全面保障了LLM任务的持续高效训练。ByteRobust已部署在拥有超过20万块GPU的生产GPU平台上,并在9600块GPU上为期三个月的训练任务中实现了97%的预计修复时间(ETTR)。