All is Not Lost: LLM Recovery without Checkpoints

Training LLMs on decentralized nodes or on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the transient churns of nodes due to failures and the operator's scheduling policies, leading to losing parts of the model (some layers). The conventional approaches to recover from failures is to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper we propose CheckFree, an efficient recovery method where a failing stage is substituted by weighted averaging of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However, because of the nature of averaging neighbouring stages, it can only recover failures of intermediate stages. We further extend our method to CheckFree+ with out-of-order pipeline execution to tolerate crashes of the first and last stages. Thanks to out-of-order pipelining, behaviour of the first and last stages are mimicked by their neighboring ones, which allows CheckFree+ to recover them by copying the neighboring stages. To recover the (de-)embedding layers, CheckFree+ copies those layers in the neighboring stages, which requires relatively small storage overhead. We extensively evaluate our method on LLaMa models of model sizes from 124M to 1.5B with varying failure frequencies. In the case of low and medium failure rates (5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant computation in terms of convergence wall-clock time, achieving up to 12% improvement over redundant computation. Both of our proposals can be ran via our code available at: https://github.com/gensyn-ai/CheckFree

翻译：在去中心化节点或即用型实例上训练LLM可降低训练成本并推动模型民主化。然而，由此带来的必然挑战是节点因故障或操作员调度策略而产生的瞬态波动，导致模型部分内容（某些层）丢失。传统的故障恢复方法要么使用检查点机制（定期将整个模型副本发送至额外存储），要么采用冗余计算。这些方法即使在无故障情况下也会产生显著的通信和/或计算开销，并且在大型模型场景下扩展性较差。本文提出CheckFree，一种高效的恢复方法，通过将故障阶段替换为最近邻阶段的加权平均来实现恢复。与现有技术相比，CheckFree无需额外计算或存储。然而，由于采用邻域阶段平均的特性，该方法仅能恢复中间阶段的故障。我们进一步扩展出CheckFree+方法，结合乱序流水线执行以容忍首末阶段的崩溃。得益于乱序流水线，首末阶段的行为可由其邻近阶段模拟，从而CheckFree+通过复制邻近阶段实现故障恢复。针对（去）嵌入层的恢复，CheckFree+在邻近阶段复制这些层，仅需较小的存储开销。我们在参数量从1.24亿到15亿的LLaMa模型上，针对不同故障频率进行了全面评估。在低中故障率（5-10%）场景下，CheckFree和CheckFree+在收敛耗时方面均优于检查点与冗余计算，相较于冗余计算实现了最高12%的性能提升。两种方法均可通过我们的开源代码运行：https://github.com/gensyn-ai/CheckFree