Root Cause Analysis (RCA) is a crucial aspect of incident management in large-scale cloud services. While the term root cause analysis or RCA has been widely used, different studies formulate the task differently. This is because the term "RCA" implicitly covers tasks with distinct underlying goals. For instance, the goal of localizing a faulty service for rapid triage is fundamentally different from identifying a specific functional bug for a definitive fix. However, previous surveys have largely overlooked these goal-based distinctions, conventionally categorizing papers by input data types (e.g., metric-based vs. trace-based methods). This leads to the grouping of works with disparate objectives, thereby obscuring the true progress and gaps in the field. Meanwhile, the typical audience of an RCA survey is either laymen who want to know the goals and big picture of the task or RCA researchers who want to figure out past research under the same task formulation. Thus, an RCA survey that organizes the related papers according to their goals is in high demand. To this end, this paper presents a goal-driven framework that effectively categorizes and integrates 135 papers on RCA in the context of cloud incident management based on their diverse goals, spanning the period from 2014 to 2025. In addition to the goal-driven categorization, it discusses the ultimate goal of all RCA papers as an umbrella covering different RCA formulations. Moreover, the paper discusses open challenges and future directions in RCA.
翻译:根因分析(RCA)是大规模云服务事件管理的关键环节。尽管“根因分析”或“RCA”一词已被广泛使用,但不同研究对该任务的界定方式各异。这是因为“RCA”这一术语隐含地涵盖了具有不同根本目标的任务。例如,为快速分诊而定位故障服务的目标,与为最终修复而识别特定功能缺陷的目标存在本质区别。然而,以往的综述大多忽视了这些基于目标的区分,通常仅按输入数据类型(例如,基于指标的方法与基于追踪的方法)对论文进行分类。这导致将目标迥异的研究归为一类,从而模糊了该领域的真实进展与不足。同时,RCA综述的典型读者群体,或是希望了解该任务目标与整体概况的初学者,或是希望在相同任务框架下梳理过往研究的RCA研究人员。因此,亟需一份能依据研究目标对相关论文进行组织的RCA综述。为此,本文提出了一个目标驱动的框架,基于2014年至2025年间云事件管理背景下135篇RCA论文的多样化目标,对其进行有效分类与整合。除了目标驱动的分类,本文还将所有RCA论文的终极目标作为一个涵盖不同RCA任务界定的总体框架进行讨论。此外,本文还探讨了RCA领域面临的开放挑战与未来研究方向。