Unresolved production cloud incidents cost an average of over $2M per hour. This paper introduces PRAXIS, an orchestrator that manages and deploys an agentic workflow for diagnosing code- and configuration-caused cloud incidents. PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) that captures microservice-level dependencies; and (2) a hammock-block program dependence graph (PDG) that captures code-level dependencies for each microservice. Compared to state-of-the-art ReAct baselines, PRAXIS improves RCA accuracy by up to 6.3x while reducing token consumption by 5.3x. PRAXIS is demonstrated on a set of 30 comprehensive real-world incidents that is being compiled into an RCA benchmark.
翻译:摘要:未解决的生产环境云故障平均每小时造成超过200万美元的损失。本文提出PRAXIS,一种编排器,用于管理和部署诊断代码及配置引发的云故障的智能体工作流。PRAXIS采用大语言模型驱动的结构化遍历方法,处理两类图结构:(1)服务依赖图(SDG),捕获微服务级依赖关系;(2)吊床块程序依赖图(PDG),捕获每个微服务的代码级依赖关系。与最先进的ReAct基线方法相比,PRAXIS将根因分析准确率提升高达6.3倍,同时将令牌消耗降低5.3倍。PRAXIS在30个真实世界综合故障案例上得到验证,这些案例正被汇编为根因分析基准测试集。