The rapid growth in the use of Large Language Models (LLMs) and AI Agents as part of software development and deployment is revolutionizing the information technology landscape. While code generation receives significant attention, a higher-impact application lies in using AI agents for operational resilience of cloud services, which currently require significant human effort and domain knowledge. There is a growing interest in AI for IT Operations (AIOps) which aims to automate complex operational tasks, like fault localization and root cause analysis, thereby reducing human intervention and customer impact. However, achieving the vision of autonomous and self-healing clouds though AIOps is hampered by the lack of standardized frameworks for building, evaluating, and improving AIOps agents. This vision paper lays the groundwork for such a framework by first framing the requirements and then discussing design decisions that satisfy them. We also propose AIOpsLab, a prototype implementation leveraging agent-cloud-interface that orchestrates an application, injects real-time faults using chaos engineering, and interfaces with an agent to localize and resolve the faults. We report promising results and lay the groundwork to build a modular and robust framework for building, evaluating, and improving agents for autonomous clouds.
翻译:大型语言模型(LLM)和人工智能代理在软件开发与部署中的快速应用,正在深刻变革信息技术格局。尽管代码生成备受关注,但更具影响力的应用在于利用AI代理提升云服务的运行韧性——当前这仍需要大量人力投入与领域知识。面向IT运营的人工智能(AIOps)正受到日益增长的关注,其旨在自动化故障定位、根因分析等复杂运维任务,从而减少人工干预与客户影响。然而,实现通过AIOps构建自主与自愈云的愿景,仍受限于缺乏用于构建、评估和改进AIOps代理的标准化框架。本愿景论文通过首先界定需求框架,进而讨论满足这些需求的设计决策,为此类框架奠定基础。我们还提出了AIOpsLab原型系统,该系统利用代理-云接口实现应用编排,通过混沌工程注入实时故障,并与代理交互以定位并解决故障。我们报告了具有前景的初步成果,并为构建模块化、鲁棒的自主云代理开发、评估与改进框架奠定了基础。