RCInvestigator: Towards Better Investigation of Anomaly Root Causes in Cloud Computing Systems

Finding the root causes of anomalies in cloud computing systems quickly is crucial to ensure availability and efficiency since accurate root causes can guide engineers to take appropriate actions to address the anomalies and maintain customer satisfaction. However, it is difficult to investigate and identify the root causes based on large-scale and high-dimension monitoring data collected from complex cloud computing environments. Due to the inherently dynamic characteristics of cloud computing systems, the existing approaches in practice largely rely on manual analyses for flexibility and reliability, but massive unpredictable factors and high data complexity make the process time-consuming. Despite recent advances in automated detection and investigation approaches, the speed and quality of root cause analyses remain limited by the lack of expert involvement in these approaches. The limitations found in the current solutions motivate us to propose a visual analytics approach that facilitates the interactive investigation of the anomaly root causes in cloud computing systems. We identified three challenges, namely, a) modeling databases for the root cause investigation, b) inferring root causes from large-scale time series, and c) building comprehensible investigation results. In collaboration with domain experts, we addressed these challenges with RCInvestigator, a novel visual analytics system that establishes a tight collaboration between human and machine and assists experts in investigating the root causes of cloud computing system anomalies. We evaluated the effectiveness of RCInvestigator through two use cases based on real-world data and received positive feedback from experts.

翻译：快速定位云计算系统中异常的根因对于确保可用性和效率至关重要，因为准确的根因可以指导工程师采取适当措施处理异常并维持客户满意度。然而，基于从复杂云计算环境收集的大规模、高维度监控数据进行调查和识别根因十分困难。由于云计算系统固有的动态特性，现有实践方法在很大程度上依赖人工分析以保证灵活性和可靠性，但海量的不可预测因素和高数据复杂性使得该过程耗时耗力。尽管自动检测与调查方法近期取得了进展，但根因分析的速度和质量仍因这些方法缺乏专家参与而受到限制。现有解决方案中的这些不足促使我们提出一种可视化分析方法，以促进对云计算系统中异常根因的交互式调查。我们识别出三大挑战，即：a) 为根因调查建立数据库模型，b) 从大规模时间序列中推断根因，以及 c) 构建易于理解的调查结果。通过与领域专家合作，我们利用RCInvestigator这一新型可视化分析系统应对了这些挑战。该系统建立了人机之间的紧密协作，协助专家调查云计算系统异常的根因。我们基于真实数据通过两个用例评估了RCInvestigator的有效性，并获得了专家的积极反馈。