Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.
翻译:大规模语言模型的训练在众多领域日益关键,但频繁的故障导致显著的时间与经济成本。当前基于云环境的故障恢复方法无法充分应对出现的多样复杂场景,仅狭隘地关注消除单个任务的中断时间,而未考虑对集群整体成本的影响。我们提出Unicron,一种专为大规模语言模型高效自愈训练设计的工作负载管理器。Unicron通过最小化集群内多个并发任务的故障相关成本来优化训练过程。其关键特性包括:带内错误检测实现实时错误识别且无额外开销,动态成本感知计划生成机制实现最优重配置,以及高效状态转换策略减少状态变更时的停机时间。在128-GPU分布式集群上的部署表明,Unicron相较于现有最优方法将训练效率提升高达1.9倍,显著降低了故障恢复成本并增强了大语言模型训练的可靠性。