AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation. This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this end, we present AIOPSLAB, a framework that not only deploys microservice cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and demonstrate how AIOPSLAB can facilitate the evaluation of next-generation AIOps agents. Through evaluations of state-of-the-art LLM agents within the benchmark created by AIOPSLAB, we provide insights into their capabilities and limitations in handling complex operational tasks in cloud environments.
翻译:面向IT运维的人工智能(AIOps)旨在自动化复杂的运维任务,例如故障定位与根因分析,以减少人工工作量并最小化对客户的影响。传统的DevOps工具和AIOps算法通常侧重于解决孤立的运维任务,而大型语言模型(LLMs)和AI代理的最新进展,通过实现端到端和多任务自动化,正在彻底改变AIOps。本文展望了这样一个未来:AI代理能够在整个事件生命周期中自主管理运维任务,从而实现自愈云系统,我们将这一范式称为AgentOps。实现这一愿景需要一个全面的框架来指导这些代理的设计、开发和评估。为此,我们提出了AIOPSLAB框架,该框架不仅能部署微服务云环境、注入故障、生成工作负载并导出遥测数据,还能协调这些组件,并提供用于与代理交互和评估的接口。我们讨论了此类综合性框架的关键要求,并展示了AIOPSLAB如何促进下一代AIOps代理的评估。通过在AIOPSLAB创建的基准测试中对最先进的LLM代理进行评估,我们深入探讨了它们在处理云环境中复杂运维任务时的能力与局限性。