LiveR: Fine-Grained Elasticity via Live Reconfiguration for Model Training

To reduce user costs and maximize cluster utilization, large model training increasingly leverages volatile but inexpensive GPU capacity, such as spot instances and reclaimable resources in shared clusters. Yet, capitalizing on these economic benefits requires jobs to adapt within the short warning windows that many such environments provide. Existing elastic training systems still treat reconfiguration as stop-and-restart: they externalize distributed state through checkpoints, rebuild the distributed runtime on a new topology, and restart training, turning each resize event into a storage-heavy recovery procedure that incurs substantial downtime from checkpoint I/O, process restart, CUDA initialization, and communicator setup. We present LiveR, a live reconfiguration runtime for elastic LLM training that replaces storage-backed restart with a live, bounded-memory handoff between mixed-parallel training worlds. While the current world continues training, LiveR asynchronously prepares the target world, bootstraps newly added workers in isolation to keep heavyweight initialization off the critical path, and streams model state directly over high-bandwidth interconnects while reshaping it online across tensor, pipeline, and data parallel dimensions. Once the target world is ready, LiveR performs a lightweight commit that switches training to the new configuration without stop-and-restart on the live path. We implement LiveR atop Megatron-LM and PyTorch and evaluate it end-to-end on a multi-node GPU cluster. Across diverse reconfiguration scenarios, LiveR reduces downtime from minutes to seconds, accelerates reconfiguration by 14$\times$-23$\times$ over checkpoint/restart baselines, incurs minimal steady-state overhead, and sustains up to 99% training goodput under volatile-resource conditions, making volatile low-cost GPU capacity far more practical for LLM training.

翻译：为降低用户成本并最大化集群利用率，大规模模型训练日益倾向于利用波动性强但成本低廉的GPU资源（如抢占式实例与共享集群中的回收资源）。然而，要发挥这些经济性优势，作业需在多数此类环境提供的短暂预警窗口期内完成自适应调整。现有弹性训练系统仍将重配置视为“停止-重启”过程：通过检查点外化分布式状态、在新拓扑结构上重建分布式运行时、再重启训练流程，使每次规格调整演变为存储密集的恢复流程，因检查点I/O、进程重启、CUDA初始化及通信器设置产生大量停机时间。我们提出LiveR——面向弹性大语言模型训练的实时重配置运行时系统，其以混合并行训练世界间的在线有界内存交接机制替代存储支撑的重启过程。当当前训练世界持续运行时，LiveR异步准备目标训练世界：隔离式引导新加入工作节点以避免重型初始化占用关键路径，同时通过高带宽互联直连流式传输模型状态，并在线完成跨张量并行、流水线并行与数据并行维度的状态重整形。目标世界就绪后，LiveR执行轻量级提交操作，在不触发实时路径停止-重启的前提下将训练切换至新配置。我们在Megatron-LM与PyTorch框架上实现LiveR，并在多节点GPU集群上完成端到端评估。面对多样化重配置场景，LiveR将停机时间从分钟级降至秒级，重配置速度较检查点/重启基线提升14倍至23倍，稳态开销极低，且在波动资源条件下维持高达99%的训练有效吞吐率，使低成本波动性GPU资源在大语言模型训练中更具实用性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

联邦学习中基础模型参数高效微调综述

专知会员服务

17+阅读 · 2025年5月5日

什么是后训练？大语言模型训练后优化方法综述，87页pdf

专知会员服务

54+阅读 · 2025年3月11日

《多模态持续预训练实用指南》，52页pdf

专知会员服务

24+阅读 · 2024年9月3日

大模型报告:模型能力决定下限，场景适配度决定上限

专知会员服务

57+阅读 · 2024年6月3日