Effectively localizing root causes of performance anomalies is crucial to enabling the rapid recovery and loss mitigation of microservice applications in the cloud. Depending on the granularity of the causes that can be localized, a service operator may take different actions, e.g., restarting or migrating services if only faulty services can be localized (namely, coarse-grained) or scaling resources if specific indicative metrics on the faulty service can be localized (namely, fine-grained). Prior research mainly focuses on coarse-grained faulty service localization, and there is now a growing interest in fine-grained root cause localization to identify faulty services and metrics. Causal inference (CI) based methods have gained popularity recently for root cause localization, but currently used CI methods have limitations, such as the linear causal relations assumption and strict data distribution requirements. To tackle these challenges, we propose a framework named CausalRCA to implement fine-grained, automated, and real-time root cause localization. The CausalRCA uses a gradient-based causal structure learning method to generate weighted causal graphs and a root cause inference method to localize root cause metrics. We conduct coarse- and fine-grained root cause localization to evaluate the localization performance of CausalRCA. Experimental results show that CausalRCA has significantly outperformed baseline methods in localization accuracy, e.g., the average AC@3 of the fine-grained root cause metric localization in the faulty service is 0.719, and the average increase is 10% compared with baseline methods. In addition, the average Avg@5 has improved by 9.43%.
翻译:有效定位性能异常的根因对于快速恢复和减少云环境中微服务应用的损失至关重要。根据可定位原因的粒度,服务运维人员可能采取不同措施,例如若仅能定位故障服务(即粗粒度),则重启或迁移服务;若能定位故障服务上的具体指标特征(即细粒度),则可进行资源伸缩。现有研究主要关注粗粒度的故障服务定位,而当前日益增长的细粒度根因定位需求旨在识别故障服务及其指标。基于因果推理的方法近年来在根因定位中广受关注,但现有因果方法存在局限性,例如线性因果假设及严格的数据分布要求。为应对这些挑战,我们提出CausalRCA框架,实现细粒度、自动化、实时的根因定位。CausalRCA采用基于梯度的因果结构学习方法生成加权因果图,并通过根因推理方法定位根因指标。我们通过粗粒度与细粒度根因定位评估CausalRCA的性能。实验结果表明,CausalRCA在定位精度上显著优于基线方法,例如故障服务内细粒度根因指标定位的平均AC@3达0.719,较基线方法平均提升10%;此外,平均Avg@5提升9.43%。