Failures in clusters running large-scale AI workloads can result in decreased utilization. Because the cost of a failure in such AI workloads is high (as it requires restarting the entire job from a previous checkpoint), there are many mechanisms in place to ensure that the failures are mitigated, and the impact of a failure is minimized. However, these mechanisms have many knobs and parameters, all of which must be carefully tuned based on the system and cluster's characteristics. We built AIReSim, a discrete event simulator to evaluate the different design choices during the failure, recovery, scheduling and repair processes for a cluster running a large-scale AI workload. AIReSim allows the system designer to systematically evaluate the effects of the different knobs and parameters on the overall end-to-end reliability of the system. Further, AIReSim can be used to identify which knobs or parameters are important in order to prioritize the investment of effort in improving the system. AIReSim also allows tuning of the knobs for achieving different tradeoffs in the system, as well as to consider various ``what-if'' scenarios. We present a case study of applying AIReSim for capacity planning for large-scale clusters running AI workloads.
翻译:运行大规模人工智能工作负载的集群若发生故障,将导致利用率下降。由于此类人工智能工作负载的故障成本高昂(需要从先前检查点重新启动作业),现有系统部署了多种机制以确保缓解故障并最小化故障影响。然而,这些机制包含众多可调节参数,必须根据系统及集群特性进行精细调优。为此,我们开发了离散事件模拟器AIReSim,用于评估运行大规模人工智能工作负载的集群在故障、恢复、调度与修复过程中的不同设计选择。AIReSim使系统设计者能够系统评估各类可调参数对系统端到端整体可靠性的影响。此外,AIReSim可用于识别关键参数,从而优先投入资源以改进系统。该模拟器还支持通过参数调优实现系统不同维度的权衡,并能考察多种“假设”场景。本文通过案例研究展示了如何运用AIReSim为运行人工智能工作负载的大规模集群进行容量规划。