DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

from arxiv, Published at HPDC '24: The 33rd International Symposium on High-Performance Parallel and Distributed Computing. Source code at https://github.com/DataStates/datastates-llm

LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 48$\times$ faster checkpointing and 2.2$\times$ faster end-to-end training runtime compared with the state-of-art checkpointing approaches.

翻译：大型语言模型（LLMs）已在各领域迅速得到应用。它们需要在高端高性能计算（HPC）基础设施上进行训练，并处理海量输入数据。在此等大规模场景下，意外事件（如组件故障、软件不稳定、不良学习模式等）频繁发生，并通常对训练过程产生负面影响。因此，需要频繁地对LLMs进行检查点保存，以便能够回滚到稳定状态并进行后续微调。然而，鉴于LLMs的巨大规模，直接将模型参数和优化器状态写入持久存储（如并行文件系统）的简单检查点方案会产生显著的I/O开销。为应对这一挑战，本文研究了如何降低I/O开销，以实现快速、可扩展的LLM检查点机制，该机制能够以高频率（直至单次迭代粒度）应用，且对训练过程影响甚微。具体而言，我们提出了一种惰性异步多级方法，该方法利用了构成模型及优化器状态分片的张量在较长时间内保持不变的特性，从而能够在训练过程中以最小干扰在后台复制其内容。我们在多达180个GPU的规模上，使用不同模型大小、并行配置和检查点频率评估了所提方法。结果表明，与现有最先进的检查点方法相比，我们的方法实现了高达48倍的检查点加速和2.2倍的端到端训练运行时加速。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日