To reduce user costs and maximize cluster utilization, large model training increasingly leverages volatile but inexpensive GPU capacity, such as spot instances and reclaimable resources in shared clusters. Yet, capitalizing on these economic benefits requires jobs to adapt within the short warning windows that many such environments provide. Existing elastic training systems still treat reconfiguration as stop-and-restart: they externalize distributed state through checkpoints, rebuild the distributed runtime on a new topology, and restart training, turning each resize event into a storage-heavy recovery procedure that incurs substantial downtime from checkpoint I/O, process restart, CUDA initialization, and communicator setup. We present LiveR, a live reconfiguration runtime for elastic LLM training that replaces storage-backed restart with a live, bounded-memory handoff between mixed-parallel training worlds. While the current world continues training, LiveR asynchronously prepares the target world, bootstraps newly added workers in isolation to keep heavyweight initialization off the critical path, and streams model state directly over high-bandwidth interconnects while reshaping it online across tensor, pipeline, and data parallel dimensions. Once the target world is ready, LiveR performs a lightweight commit that switches training to the new configuration without stop-and-restart on the live path. We implement LiveR atop Megatron-LM and PyTorch and evaluate it end-to-end on a multi-node GPU cluster. Across diverse reconfiguration scenarios, LiveR reduces downtime from minutes to seconds, accelerates reconfiguration by 14$\times$-23$\times$ over checkpoint/restart baselines, incurs minimal steady-state overhead, and sustains up to 99% training goodput under volatile-resource conditions, making volatile low-cost GPU capacity far more practical for LLM training.
翻译:为降低用户成本并最大化集群利用率,大规模模型训练日益倾向于利用波动性强但成本低廉的GPU资源(如抢占式实例与共享集群中的回收资源)。然而,要发挥这些经济性优势,作业需在多数此类环境提供的短暂预警窗口期内完成自适应调整。现有弹性训练系统仍将重配置视为“停止-重启”过程:通过检查点外化分布式状态、在新拓扑结构上重建分布式运行时、再重启训练流程,使每次规格调整演变为存储密集的恢复流程,因检查点I/O、进程重启、CUDA初始化及通信器设置产生大量停机时间。我们提出LiveR——面向弹性大语言模型训练的实时重配置运行时系统,其以混合并行训练世界间的在线有界内存交接机制替代存储支撑的重启过程。当当前训练世界持续运行时,LiveR异步准备目标训练世界:隔离式引导新加入工作节点以避免重型初始化占用关键路径,同时通过高带宽互联直连流式传输模型状态,并在线完成跨张量并行、流水线并行与数据并行维度的状态重整形。目标世界就绪后,LiveR执行轻量级提交操作,在不触发实时路径停止-重启的前提下将训练切换至新配置。我们在Megatron-LM与PyTorch框架上实现LiveR,并在多节点GPU集群上完成端到端评估。面对多样化重配置场景,LiveR将停机时间从分钟级降至秒级,重配置速度较检查点/重启基线提升14倍至23倍,稳态开销极低,且在波动资源条件下维持高达99%的训练有效吞吐率,使低成本波动性GPU资源在大语言模型训练中更具实用性。