AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling

Failures in clusters running large-scale AI workloads can result in decreased utilization. Because the cost of a failure in such AI workloads is high (as it requires restarting the entire job from a previous checkpoint), there are many mechanisms in place to ensure that the failures are mitigated, and the impact of a failure is minimized. However, these mechanisms have many knobs and parameters, all of which must be carefully tuned based on the system and cluster's characteristics. We built AIReSim, a discrete event simulator to evaluate the different design choices during the failure, recovery, scheduling and repair processes for a cluster running a large-scale AI workload. AIReSim allows the system designer to systematically evaluate the effects of the different knobs and parameters on the overall end-to-end reliability of the system. Further, AIReSim can be used to identify which knobs or parameters are important in order to prioritize the investment of effort in improving the system. AIReSim also allows tuning of the knobs for achieving different tradeoffs in the system, as well as to consider various ``what-if'' scenarios. We present a case study of applying AIReSim for capacity planning for large-scale clusters running AI workloads.

翻译：运行大规模AI工作负载的集群发生故障会导致利用率下降。由于此类AI工作负载中单次故障成本高昂（需从先前检查点重新启动整个作业），现有多种机制用于缓解故障并最小化其影响。然而这些机制涉及众多调控参数，必须根据系统和集群特性进行精细调优。我们构建了AIReSim，一种用于评估运行大规模AI工作负载集群在故障、恢复、调度与修复过程中不同设计选择的离散事件模拟器。AIReSim使系统设计者能够系统评估不同调控参数对系统端到端可靠性的综合影响。此外，AIReSim可识别关键调控参数，从而优先配置系统改进资源。该模拟器亦支持通过参数调优实现系统的不同性能权衡，并分析各类"假设"场景。本文通过案例研究展示了AIReSim在面向AI工作负载的大规模集群容量规划中的应用。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

AlphaMosaic：人工智能赋能的作战管理系统

专知会员服务

46+阅读 · 2025年8月19日

中文版 | 集中式与分布式多智能体AI协调策略

专知会员服务

22+阅读 · 2025年5月8日

大型语言模型时代AIOps在故障管理中的综述

专知会员服务

43+阅读 · 2024年6月23日

博士论文：领导者-追随者多智能体系统的瞬态控制与时空逻辑任务的应用

专知会员服务

48+阅读 · 2023年10月26日