The rapid growth in the use of Large Language Models (LLMs) and AI Agents as part of software development and deployment is revolutionizing the information technology landscape. While code generation receives significant attention, a higher-impact application lies in using AI agents for operational resilience of cloud services, which currently require significant human effort and domain knowledge. There is a growing interest in AI for IT Operations (AIOps) which aims to automate complex operational tasks, like fault localization and root cause analysis, thereby reducing human intervention and customer impact. However, achieving the vision of autonomous and self-healing clouds through AIOps is hampered by the lack of standardized frameworks for building, evaluating, and improving AIOps agents. This vision paper lays the groundwork for such a framework by first framing the requirements and then discussing design decisions that satisfy them. We also propose AIOpsLab, a prototype implementation leveraging agent-cloud-interface that orchestrates an application, injects real-time faults using chaos engineering, and interfaces with an agent to localize and resolve the faults. We report promising results and lay the groundwork to build a modular and robust framework for building, evaluating, and improving agents for autonomous clouds.
翻译:大型语言模型(LLM)和人工智能代理在软件开发与部署中的快速应用正在深刻变革信息技术格局。尽管代码生成备受关注,但更具影响力的应用在于利用人工智能代理提升云服务的运行韧性——当前这仍需要大量人力投入与领域知识。人工智能运维(AIOps)日益受到关注,其旨在自动化复杂运维任务(如故障定位与根因分析),从而减少人工干预与客户影响。然而,通过AIOps实现自主与自愈云愿景的进程,因缺乏构建、评估与改进AIOps代理的标准化框架而受阻。本愿景论文通过首先界定需求框架,进而讨论满足这些需求的设计决策,为此类框架奠定基础。我们还提出了AIOpsLab原型系统,该系统利用代理-云接口协调应用程序,通过混沌工程注入实时故障,并与代理交互以实现故障定位与修复。我们报告了具有前景的初步成果,并为构建模块化、鲁棒的自主云代理开发、评估与改进框架奠定基础。