As software systems grow increasingly intricate, Artificial Intelligence for IT Operations (AIOps) methods have been widely used in software system failure management to ensure the high availability and reliability of large-scale distributed software systems. However, these methods still face several challenges, such as lack of cross-platform generality and cross-task flexibility. Fortunately, recent advancements in large language models (LLMs) can significantly address these challenges, and many approaches have already been proposed to explore this field. However, there is currently no comprehensive survey that discusses the differences between LLM-based AIOps and traditional AIOps methods. Therefore, this paper presents a comprehensive survey of AIOps technology for failure management in the LLM era. It includes a detailed definition of AIOps tasks for failure management, the data sources for AIOps, and the LLM-based approaches adopted for AIOps. Additionally, this survey explores the AIOps subtasks, the specific LLM-based approaches suitable for different AIOps subtasks, and the challenges and future directions of the domain, aiming to further its development and application.
翻译:随着软件系统日益复杂,智能运维(AIOps)方法已被广泛应用于软件系统故障管理,以确保大规模分布式软件系统的高可用性与可靠性。然而,这些方法仍面临若干挑战,例如跨平台通用性与跨任务灵活性的不足。幸运的是,大语言模型(LLMs)的最新进展能够显著应对这些挑战,且已有诸多方法被提出以探索这一领域。然而,目前尚无系统性综述探讨基于LLM的智能运维与传统智能运维方法之间的差异。为此,本文对大语言模型时代面向故障管理的智能运维技术进行了全面综述。内容包括:故障管理场景下智能运维任务的详细定义、智能运维的数据来源、以及适用于智能运维的基于大语言模型的方法。此外,本综述还探讨了智能运维的子任务、适用于不同智能运维子任务的具体基于大语言模型的方法,以及该领域面临的挑战与未来方向,旨在推动其进一步发展与应用。