When networked system failures occur, automatically performing Root Cause Analysis (RCA) using observability data is critical for ensuring networked system reliability. Recently, LLM-based agents have shown promise for automating this diagnosis process through advanced reasoning and autonomous exploration. However, existing observability frameworks remain archaic, characterized by fragmented data silos, incompatible schemas, and insufficient semantic metadata, preventing agents from establishing the complex relationships required for effective RCA. To address these challenges, we present UModel, a unified ontological framework that shifts observability from data-centric to object-centric modeling. UModel constructs a virtual ontological layer where heterogeneous telemetry, entities, and expert knowledge are standardized as objects and interconnected via semantic graphs. In addition, we introduce U-SPL, a pipeline-based query interface that enables agents to autonomously explore system topologies and correlate multimodal data. By re-modeling the "AIOps 2025 Challenge" dataset using UModel, the precision of root cause localization improved by 8%, demonstrating that enhanced data organization can significantly increase the accuracy of downstream tasks. UModel provides a scalable modeling framework that, in its deployment at Alibaba Cloud for more than one year, has served tens of thousands of users, sustained millions of operations per second, and delivered sub-second query latency.
翻译:摘要:当网络化系统发生故障时,利用可观测性数据自动执行根因分析对于保障系统可靠性至关重要。近年来,基于大语言模型的智能体通过高级推理与自主探索,展现出自动化诊断流程的潜力。然而,现有可观测性框架仍存在数据孤岛碎片化、模式不兼容、语义元数据不足等陈旧问题,阻碍了智能体建立有效根因分析所需的复杂关联关系。针对这些挑战,本文提出UModel——一种统一本体框架,将可观测性从以数据为中心的建模范式转向以对象为中心的建模。UModel构建虚拟本体层,将异构遥测数据、实体和专家知识标准化为对象,并通过语义图实现互联互通。此外,我们引入基于流水线的查询接口U-SPL,使智能体能够自主探索系统拓扑并关联多模态数据。通过使用UModel重构"AIOps 2025挑战赛"数据集,根因定位准确率提升8%,验证了增强的数据组织可显著提升下游任务精度。UModel提供可扩展的建模框架,在阿里云部署一年多的实践中,已服务数万用户,维持每秒百万级操作,并实现亚秒级查询延迟。