Anomaly Detection and Root Cause Analysis for Microservice Systems

Microservice systems are widely used to build cloud applications, yet their complexity makes failures inevitable, degrading user experience and causing economic loss. Automated anomaly detection and root cause analysis (RCA) are now active research areas, but existing techniques share five limitations. First, most treat anomaly detection and RCA separately, assuming anomalies are detected correctly, and falter when detection is imprecise due to noise or delay. Second, they focus on metrics, logs, and traces, leaving event data such as API calls and configuration changes underexplored. Third, many require a given service call graph and cannot diagnose without one. Fourth, the field lacks standardised datasets and evaluation frameworks, so methods are hard to compare fairly. Fifth, although causal inference-based RCA has become dominant, its effectiveness, efficiency, and robustness remain unclear. This thesis addresses these limitations through two groups of contributions. The first introduces methods that exploit observability data both independently and collectively. BARO is an end-to-end anomaly detection and RCA approach for metric data. EventADL is an end-to-end framework for event data. TORAI is a multimodal RCA framework that requires no service call graph. Extensive experiments on real microservice systems demonstrate their effectiveness and robustness. The second group delivers benchmarking datasets, an evaluation framework, and systematic evaluation efforts. RCAEval is a comprehensive benchmark providing ready-to-use datasets and reproducible baselines for future research. A systematic evaluation of existing RCA methods, especially causal inference-based approaches, offers insights that guide future directions. This thesis thereby advances automated anomaly detection and RCA for microservice failures, enabling future research on incident mitigation and remediation.

翻译：微服务系统被广泛用于构建云应用，但其复杂性使得故障不可避免，从而降低用户体验并造成经济损失。自动化的异常检测与根因分析（RCA）目前是活跃的研究领域，但现有技术存在五个局限。首先，多数方法将异常检测与RCA分开处理，假设异常已被正确检测，当由于噪声或延迟导致检测不精确时，这些方法会失效。其次，它们聚焦于指标、日志和链路追踪，而对API调用和配置变更等事件数据探索不足。第三，许多方法需要给定的服务调用图，若缺乏该图则无法进行诊断。第四，该领域缺乏标准化的数据集和评估框架，导致方法之间难以公平比较。第五，尽管基于因果推断的RCA已成为主流，但其有效性、效率和鲁棒性仍不明确。本论文通过两组贡献来解决这些局限。第一组引入了独立或联合利用可观测性数据的方法。BARO是一种面向指标数据的端到端异常检测与RCA方法。EventADL是一个面向事件数据的端到端框架。TORAI则是一个无需服务调用图的多模态RCA框架。在真实微服务系统上的大量实验证明了它们的有效性和鲁棒性。第二组贡献提供了基准数据集、评估框架以及系统性的评估工作。RCAEval是一个综合基准，为未来研究提供了即用型数据集和可复现的基线。对现有RCA方法（特别是基于因果推断的方法）的系统性评估，为指引未来方向提供了洞察。本论文由此推进了微服务故障的自动化异常检测与RCA，为事件缓解与修复的未来研究奠定了基础。