Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity. The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques.
翻译:现有检查点方法似乎难以适配分布式训练场景,尽管硬件限制使得模型并行(即将模型状态分片至多个加速器)成为模型扩展的必要手段。将分布式模型状态整合为单一检查点会严重拖慢训练速度,且在极端规模下完全不具可行性。相比之下,分布式检查点与训练运行的模型并行策略及硬件配置紧密耦合,因而无法在不同配置下复用。为解决此问题,我们提出通用检查点技术,该技术既能实现高效的检查点创建,又能支持在任意并行策略与硬件配置上灵活恢复训练。通用检查点技术为大规模训练解锁了前所未有的能力:通过利用剩余健康硬件持续训练来提升硬件故障的容错能力,以及通过弹性容量的机会性利用来缩短训练时间。该技术的核心洞见在于为检查点生命周期的每个阶段选择最优表示形式:保存阶段采用分布式表示,加载阶段采用整合表示。这通过两项关键机制实现:首先是通用检查点格式,该格式包含每个模型参数的整合表示,以及将参数分片映射至任意模型并行配置训练节点的元数据;其次是通用检查点语言,这是一种简洁而强大的规范语言,用于将分布式检查点转换为通用检查点格式。我们的评估表明,通用检查点技术在先进模型架构与多样化并行技术上均展现出卓越的有效性与普适性。