Failures in clusters running large-scale AI workloads can result in decreased utilization. Because the cost of a failure in such AI workloads is high (as it requires restarting the entire job from a previous checkpoint), there are many mechanisms in place to ensure that the failures are mitigated, and the impact of a failure is minimized. However, these mechanisms have many knobs and parameters, all of which must be carefully tuned based on the system and cluster's characteristics. We built AIReSim, a discrete event simulator to evaluate the different design choices during the failure, recovery, scheduling and repair processes for a cluster running a large-scale AI workload. AIReSim allows the system designer to systematically evaluate the effects of the different knobs and parameters on the overall end-to-end reliability of the system. Further, AIReSim can be used to identify which knobs or parameters are important in order to prioritize the investment of effort in improving the system. AIReSim also allows tuning of the knobs for achieving different tradeoffs in the system, as well as to consider various ``what-if'' scenarios. We present a case study of applying AIReSim for capacity planning for large-scale clusters running AI workloads.
翻译:运行大规模AI工作负载的集群发生故障会导致利用率下降。由于此类AI工作负载中单次故障成本高昂(需从先前检查点重新启动整个作业),现有多种机制用于缓解故障并最小化其影响。然而这些机制涉及众多调控参数,必须根据系统和集群特性进行精细调优。我们构建了AIReSim,一种用于评估运行大规模AI工作负载集群在故障、恢复、调度与修复过程中不同设计选择的离散事件模拟器。AIReSim使系统设计者能够系统评估不同调控参数对系统端到端可靠性的综合影响。此外,AIReSim可识别关键调控参数,从而优先配置系统改进资源。该模拟器亦支持通过参数调优实现系统的不同性能权衡,并分析各类"假设"场景。本文通过案例研究展示了AIReSim在面向AI工作负载的大规模集群容量规划中的应用。