Revisiting Reliability in Large-Scale Machine Learning Research Clusters

Reliability is a fundamental challenge in operating large-scale machine learning (ML) infrastructures, particularly as the scale of ML models and training clusters continues to grow. Despite decades of research on infrastructure failures, the impact of job failures across different scales remains unclear. This paper presents a view of managing two large, multi-tenant ML clusters, providing quantitative analysis, operational experience, and our own perspective in understanding and addressing reliability concerns at scale. Our analysis reveals that while large jobs are most vulnerable to failures, smaller jobs make up the majority of jobs in the clusters and should be incorporated into optimization objectives. We identify key workload properties, compare them across clusters, and demonstrate essential reliability requirements for pushing the boundaries of ML training at scale. We hereby introduce a taxonomy of failures and key reliability metrics, analyze 11 months of data from two state-of-the-art ML environments with over 150 million A100 GPU hours and 4 million jobs. Building on our data, we fit a failure model to project Mean Time to Failure for various GPU scales. We further propose a method to estimate a related metric, Effective Training Time Ratio, as a function of job parameters, and we use this model to gauge the efficacy of potential software mitigations at scale. Our work provides valuable insights and future research directions for improving the reliability of AI supercomputer clusters, emphasizing the need for flexible, workload-agnostic, and reliability-aware infrastructure, system software, and algorithms.

翻译：可靠性是运营大规模机器学习（ML）基础设施时面临的一项根本性挑战，尤其是在ML模型和训练集群规模持续扩大的背景下。尽管针对基础设施故障的研究已有数十年，但不同规模下的作业故障影响仍不明确。本文介绍了管理两个大型多租户ML集群的视角，提供了定量分析、运维经验以及我们在理解和解决大规模可靠性问题上的观点。我们的分析表明，虽然大型作业最容易受到故障影响，但小型作业在集群中占绝大多数，应被纳入优化目标。我们识别了关键的工作负载特性，在不同集群间进行了比较，并论证了推动大规模ML训练边界所需的基本可靠性要求。为此，我们引入了一种故障分类法和关键可靠性指标，分析了来自两个最先进ML环境（累计超过1.5亿A100 GPU小时和400万个作业）的11个月数据。基于这些数据，我们拟合了一个故障模型，以预测不同GPU规模下的平均无故障时间。我们进一步提出了一种方法来估算相关指标——有效训练时间比，将其作为作业参数的函数，并利用该模型评估潜在软件缓解措施在大规模场景下的有效性。我们的工作为提升AI超级计算机集群的可靠性提供了有价值的见解和未来研究方向，强调了对灵活、与工作负载无关且具备可靠性意识的基础设施、系统软件和算法的需求。