Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity. The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques.

翻译：现有检查点方法似乎难以适配分布式训练场景，尽管硬件限制使得模型并行（即将模型状态分片至多个加速器）成为模型扩展的必要手段。将分布式模型状态整合为单一检查点会严重拖慢训练速度，且在极端规模下完全不具可行性。相比之下，分布式检查点与训练运行的模型并行策略及硬件配置紧密耦合，因而无法在不同配置下复用。为解决此问题，我们提出通用检查点技术，该技术既能实现高效的检查点创建，又能支持在任意并行策略与硬件配置上灵活恢复训练。通用检查点技术为大规模训练解锁了前所未有的能力：通过利用剩余健康硬件持续训练来提升硬件故障的容错能力，以及通过弹性容量的机会性利用来缩短训练时间。该技术的核心洞见在于为检查点生命周期的每个阶段选择最优表示形式：保存阶段采用分布式表示，加载阶段采用整合表示。这通过两项关键机制实现：首先是通用检查点格式，该格式包含每个模型参数的整合表示，以及将参数分片映射至任意模型并行配置训练节点的元数据；其次是通用检查点语言，这是一种简洁而强大的规范语言，用于将分布式检查点转换为通用检查点格式。我们的评估表明，通用检查点技术在先进模型架构与多样化并行技术上均展现出卓越的有效性与普适性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日