The complexity and dynamism of microservices pose significant challenges to system reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization after anomaly detection is crucial for ensuring the reliability of microservice systems. However, two significant issues rest in existing approaches: (1) Microservices generate traces, system logs, and key performance indicators (KPIs), but existing approaches usually consider traces only, failing to understand the system fully as traces cannot depict all anomalies; (2) Troubleshooting microservices generally contains two main phases, i.e., anomaly detection and root cause localization. Existing studies regard these two phases as independent, ignoring their close correlation. Even worse, inaccurate detection results can deeply affect localization effectiveness. To overcome these limitations, we propose Eadro, the first end-to-end framework to integrate anomaly detection and root cause localization based on multi-source data for troubleshooting large-scale microservices. The key insights of Eadro are the anomaly manifestations on different data sources and the close connection between detection and localization. Thus, Eadro models intra-service behaviors and inter-service dependencies from traces, logs, and KPIs, all the while leveraging the shared knowledge of the two phases via multi-task learning. Experiments on two widely-used benchmark microservices demonstrate that Eadro outperforms state-of-the-art approaches by a large margin. The results also show the usefulness of integrating multi-source data. We also release our code and data to facilitate future research.
翻译:微服务的复杂性与动态性给系统可靠性带来重大挑战,因此自动化故障排查至关重要。在异常检测后实现有效的根因定位,对确保微服务系统可靠性具有关键作用。然而,现有方法存在两大问题:(1)微服务会产生调用链、系统日志和关键性能指标(KPI),但现有方法通常仅分析调用链,无法全面理解系统状态——因调用链无法刻画所有异常;(2)微服务故障排查通常包含异常检测与根因定位两个主要阶段,现有研究将二者视为独立环节,忽略其紧密关联,导致不准确的检测结果严重削弱定位效果。为克服这些局限,我们提出Eadro——首个基于多源数据、将异常检测与根因定位集成至端到端框架以排查大规模微服务故障的方案。Eadro的关键洞察在于不同数据源上的异常表现特征,以及检测与定位间的密切关联。为此,Eadro从调用链、日志和KPI中建模服务内部行为与服务间依赖关系,并通过多任务学习充分利用两个阶段的共享知识。在两个广泛使用的基准微服务上的实验表明,Eadro大幅优于现有最先进方法,结果同时验证了多源数据融合的有效性。我们已开源代码与数据,以促进后续研究。