The escalating complexity of micro-services architecture in cloud-native technologies poses significant challenges for maintaining system stability and efficiency. To conduct root cause analysis (RCA) and resolution of alert events, we propose a pioneering framework, multi-Agent Blockchain-inspired Collaboration for root cause analysis in micro-services architecture (mABC), to revolutionize the AI for IT operations (AIOps) domain, where multiple agents based on the powerful large language models (LLMs) perform blockchain-inspired voting to reach a final agreement following a standardized process for processing tasks and queries provided by Agent Workflow. Specifically, seven specialized agents derived from Agent Workflow each provide valuable insights towards root cause analysis based on their expertise and the intrinsic software knowledge of LLMs collaborating within a decentralized chain. To avoid potential instability issues in LLMs and fully leverage the transparent and egalitarian advantages inherent in a decentralized structure, mABC adopts a decision-making process inspired by blockchain governance principles while considering the contribution index and expertise index of each agent. Experimental results on the public benchmark AIOps challenge dataset and our created train-ticket dataset demonstrate superior performance in accurately identifying root causes and formulating effective solutions, compared to previous strong baselines. The ablation study further highlights the significance of each component within mABC, with Agent Workflow, multi-agent, and blockchain-inspired voting being crucial for achieving optimal performance. mABC offers a comprehensive automated root cause analysis and resolution in micro-services architecture and achieves a significant improvement in the AIOps domain compared to existing baselines
翻译:摘要:云原生技术中微服务架构的日益复杂性给系统稳定性与效率的维护带来了重大挑战。针对告警事件的根因分析(RCA)与处置问题,我们提出了一种创新性框架——面向微服务架构根因分析的多智能体区块链启发式协作框架(mABC),旨在革新IT运维人工智能(AIOps)领域。该框架中,基于强大大型语言模型(LLMs)的多个智能体遵循智能体工作流(Agent Workflow)提供的标准化流程处理任务与查询,通过区块链启发式投票机制达成最终共识。具体而言,源自智能体工作流的七个专业化智能体基于各自领域专长及LLMs内在的软件知识,在去中心化链式结构中协同提供根因分析洞见。为避免LLMs潜在的稳定性问题并充分利用去中心化结构固有的透明性与平等性优势,mABC采用了一种受区块链治理原则启发的决策流程,同时综合考虑各智能体的贡献指数与专业指数。在公共基准AIOps挑战赛数据集及我们自主构建的火车票数据集上的实验结果表明,相较于先前强基线方法,该方法在准确识别根因及制定有效解决方案方面展现了卓越性能。消融研究进一步凸显了mABC各组件的重要性,其中智能体工作流、多智能体协作与区块链启发式投票对实现最优性能至关重要。mABC为微服务架构提供了全面的自动化根因分析与处置,并在AIOps领域相比现有基线方法实现了显著性能提升。