As software systems grow increasingly intricate, Artificial Intelligence for IT Operations (AIOps) methods have been widely used in software system failure management to ensure the high availability and reliability of large-scale distributed software systems. However, these methods still face several challenges, such as lack of cross-platform generality and cross-task flexibility. Fortunately, recent advancements in large language models (LLMs) can significantly address these challenges, and many approaches have already been proposed to explore this field. However, there is currently no comprehensive survey that discusses the differences between LLM-based AIOps and traditional AIOps methods. Therefore, this paper presents a comprehensive survey of AIOps technology for failure management in the LLM era. It includes a detailed definition of AIOps tasks for failure management, the data sources for AIOps, and the LLM-based approaches adopted for AIOps. Additionally, this survey explores the AIOps subtasks, the specific LLM-based approaches suitable for different AIOps subtasks, and the challenges and future directions of the domain, aiming to further its development and application.
翻译:随着软件系统日益复杂,人工智能运维(AIOps)方法已被广泛应用于软件系统故障管理,以确保大规模分布式软件系统的高可用性与高可靠性。然而,这些方法仍面临若干挑战,例如缺乏跨平台通用性与跨任务灵活性。所幸,大语言模型(LLMs)的最新进展能够显著应对这些挑战,且已有诸多方法被提出以探索这一领域。然而,目前尚无全面综述探讨基于LLM的AIOps与传统AIOps方法之间的差异。为此,本文对LLM时代下用于故障管理的AIOps技术进行了全面综述。内容包括故障管理AIOps任务的详细定义、AIOps的数据源以及AIOps所采用的基于LLM的方法。此外,本综述探讨了AIOps的子任务、适用于不同AIOps子任务的特定基于LLM的方法,以及该领域面临的挑战与未来方向,旨在推动其进一步发展与应用。