The escalating complexity of micro-services architecture in cloud-native technologies poses significant challenges for maintaining system stability and efficiency. To conduct root cause analysis (RCA) and resolution of alert events, we propose a pioneering framework, multi-Agent Blockchain-inspired Collaboration for root cause analysis in micro-services architecture (mABC), to revolutionize the AI for IT operations (AIOps) domain, where multiple agents based on the powerful large language models (LLMs) perform blockchain-inspired voting to reach a final agreement following a standardized process for processing tasks and queries provided by Agent Workflow. Specifically, seven specialized agents derived from Agent Workflow each provide valuable insights towards root cause analysis based on their expertise and the intrinsic software knowledge of LLMs collaborating within a decentralized chain. To avoid potential instability issues in LLMs and fully leverage the transparent and egalitarian advantages inherent in a decentralized structure, mABC adopts a decision-making process inspired by blockchain governance principles while considering the contribution index and expertise index of each agent. Experimental results on the public benchmark AIOps challenge dataset and our created train-ticket dataset demonstrate superior performance in accurately identifying root causes and formulating effective solutions, compared to previous strong baselines. The ablation study further highlights the significance of each component within mABC, with Agent Workflow, multi-agent, and blockchain-inspired voting being crucial for achieving optimal performance. mABC offers a comprehensive automated root cause analysis and resolution in micro-services architecture and achieves a significant improvement in the AIOps domain compared to existing baselines
翻译:摘要:云原生技术中微服务架构日益增长的复杂性给系统稳定性与效率维护带来了严峻挑战。为应对告警事件的根因分析与处置,我们提出了一种开创性框架——面向微服务架构根因分析的多智能体区块链启发式协作框架(mABC),旨在革新AIOps领域。该框架中,基于强大大型语言模型(LLM)的多个智能体遵循标准化流程(由智能体工作流提供的任务与查询),通过区块链启发的投票机制达成最终共识。具体而言,源自智能体工作流的七个专业智能体基于各自领域专长与LLM内在的软件知识,在去中心化链上协作贡献根因分析洞察。为避免LLM潜在的不稳定性问题,并充分利用去中心化结构的透明性与平等性优势,mABC采用基于区块链治理原则的决策流程,同时考量各智能体的贡献指数与专业指数。在公开基准AIOps挑战数据集及自建工单数据集上的实验结果表明,与先前强基线方法相比,该方法在精准识别根因与制定有效解决方案方面表现优异。消融实验进一步揭示了mABC各组件的重要性,其中智能体工作流、多智能体协作与区块链启发的投票机制对实现最优性能至关重要。mABC为微服务架构提供了全自动化的根因分析与处置方案,在AIOps领域较现有基线方法实现了显著性能提升。