Modern enterprise systems exhibit complex interdependencies that make observability and incident response increasingly challenging. Manual alert triage, which typically involves log inspection, API verification, and cross-referencing operational knowledge bases, remains a major bottleneck in reducing mean recovery time (MTTR). This paper presents an agentic observability framework deployed within Adobe's e-commerce infrastructure that autonomously performs alert triage using a ReAct paradigm. Upon alert detection, the agent dynamically identifies the affected service, retrieves and analyzes correlated logs across distributed systems, and plans context-dependent actions such as handbook consultation, runbook execution, or retrieval-augmented analysis of recently deployed code. Empirical results from production deployment indicate a 90% reduction in mean time to insight compared to manual triage, while maintaining comparable diagnostic accuracy. Our results show that agentic AI enables an order-of-magnitude reduction in triage latency and a step-change in resolution accuracy, marking a pivotal shift toward autonomous observability in enterprise operations.
翻译:现代企业系统呈现出复杂的相互依赖性,使得可观测性与事件响应日益具有挑战性。手动告警分诊通常涉及日志检查、API验证和操作知识库交叉引用,这仍然是降低平均恢复时间(MTTR)的主要瓶颈。本文提出了一种部署于Adobe电子商务基础设施中的智能可观测性框架,该框架利用ReAct范式自主执行告警分诊。在检测到告警后,智能体动态识别受影响的服务,检索并分析分布式系统中的关联日志,并规划上下文相关的操作,例如手册查阅、运行手册执行或对近期部署代码进行检索增强分析。生产环境部署的实证结果表明,与手动分诊相比,平均洞察时间减少了90%,同时保持了相当的诊断准确性。我们的研究结果表明,智能体人工智能能够将分诊延迟降低一个数量级,并在解决准确性上实现阶跃式提升,这标志着企业运营向自主可观测性迈出了关键一步。