LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 48$\times$ faster checkpointing and 2.2$\times$ faster end-to-end training runtime compared with the state-of-art checkpointing approaches.
翻译:大型语言模型(LLMs)已在各领域迅速得到应用。它们需要在高端高性能计算(HPC)基础设施上进行训练,并处理海量输入数据。在此等大规模场景下,意外事件(如组件故障、软件不稳定、不良学习模式等)频繁发生,并通常对训练过程产生负面影响。因此,需要频繁地对LLMs进行检查点保存,以便能够回滚到稳定状态并进行后续微调。然而,鉴于LLMs的巨大规模,直接将模型参数和优化器状态写入持久存储(如并行文件系统)的简单检查点方案会产生显著的I/O开销。为应对这一挑战,本文研究了如何降低I/O开销,以实现快速、可扩展的LLM检查点机制,该机制能够以高频率(直至单次迭代粒度)应用,且对训练过程影响甚微。具体而言,我们提出了一种惰性异步多级方法,该方法利用了构成模型及优化器状态分片的张量在较长时间内保持不变的特性,从而能够在训练过程中以最小干扰在后台复制其内容。我们在多达180个GPU的规模上,使用不同模型大小、并行配置和检查点频率评估了所提方法。结果表明,与现有最先进的检查点方法相比,我们的方法实现了高达48倍的检查点加速和2.2倍的端到端训练运行时加速。