The escalating complexity of modern software imposes an unsustainable operational burden on Site Reliability Engineering (SRE) teams, demanding AI-driven automation that can emulate expert diagnostic reasoning. Existing solutions, from traditional AI methods to general-purpose multi-agent systems, fall short: they either lack deep causal reasoning or are not tailored for the specialized, investigative workflows unique to SRE. To address this gap, we present OpenDerisk, a specialized, open-source multi-agent framework architected for SRE. OpenDerisk integrates a diagnostic-native collaboration model, a pluggable reasoning engine, a knowledge engine, and a standardized protocol (MCP) to enable specialist agents to collectively solve complex, multi-domain problems. Our comprehensive evaluation demonstrates that OpenDerisk significantly outperforms state-of-the-art baselines in both accuracy and efficiency. This effectiveness is validated by its large-scale production deployment at Ant Group, where it serves over 3,000 daily users across diverse scenarios, confirming its industrial-grade scalability and practical impact. OpenDerisk is open source and available at https://github.com/derisk-ai/OpenDerisk/
翻译:现代软件日益增长的复杂性给站点可靠性工程(SRE)团队带来了难以持续承受的运维负担,亟需能够模拟专家诊断推理能力的AI驱动自动化方案。现有解决方案,从传统AI方法到通用多智能体系统,均存在不足:它们要么缺乏深度的因果推理能力,要么未能针对SRE特有的专业化、调查式工作流程进行定制。为弥补这一空白,我们提出了OpenDerisk,一个专为SRE架构设计的开源多智能体框架。OpenDerisk集成了诊断原生的协作模型、可插拔推理引擎、知识引擎以及标准化协议(MCP),使专业智能体能够协同解决复杂的多领域问题。我们的全面评估表明,OpenDerisk在准确性和效率方面均显著优于最先进的基线方法。其有效性已在蚂蚁集团的大规模生产部署中得到验证,该框架服务于超过3000名日活跃用户,覆盖多样化场景,证实了其工业级的可扩展性和实际影响力。OpenDerisk是开源项目,可通过 https://github.com/derisk-ai/OpenDerisk/ 获取。