Debugging performance anomalies in real-world databases is challenging. Causal inference techniques enable qualitative and quantitative root cause analysis of performance downgrade. Nevertheless, causality analysis is practically challenging, particularly due to limited observability. Recently, chaos engineering has been applied to test complex real-world software systems. Chaos frameworks like Chaos Mesh mutate a set of chaos variables to inject catastrophic events (e.g., network slowdowns) to "stress" software systems. The systems under chaos stress are then tested using methods like differential testing to check if they retain their normal functionality (e.g., SQL query output is always correct under stress). Despite its ubiquity in the industry, chaos engineering is now employed mostly to aid software testing rather for performance debugging. This paper identifies novel usage of chaos engineering on helping developers diagnose performance anomalies in databases. Our presented framework, PERFCE, comprises an offline phase and an online phase. The offline phase learns the statistical models of the target database system, whilst the online phase diagnoses the root cause of monitored performance anomalies on the fly. During the offline phase, PERFCE leverages both passive observations and proactive chaos experiments to constitute accurate causal graphs and structural equation models (SEMs). When observing performance anomalies during the online phase, causal graphs enable qualitative root cause identification (e.g., high CPU usage) and SEMs enable quantitative counterfactual analysis (e.g., determining "when CPU usage is reduced to 45\%, performance returns to normal"). PERFCE notably outperforms prior works on common synthetic datasets, and our evaluation on real-world databases, MySQL and TiDB, shows that PERFCE is highly accurate and moderately expensive.
翻译:调试真实数据库中的性能异常具有挑战性。因果推断技术能够对性能下降进行定性和定量的根本原因分析。然而,因果关系分析在实践中面临诸多挑战,尤其是由于可观测性有限。近期,混沌工程已被应用于测试复杂的真实软件系统。像Chaos Mesh这样的混沌框架通过突变一组混沌变量来注入灾难性事件(例如网络变慢)以“压力测试”软件系统。然后,采用差异测试等方法检测受混沌压力影响的系统是否保持正常功能(例如,在压力下SQL查询输出始终正确)。尽管在工业界得到广泛应用,混沌工程目前主要用于辅助软件测试,而非性能调试。本文发现了混沌工程在帮助开发者诊断数据库性能异常方面的新用途。我们提出的框架PERFCE包含离线阶段和在线阶段。离线阶段学习目标数据库系统的统计模型,而在线阶段则实时诊断所监测性能异常的根本原因。在离线阶段,PERFCE结合被动观测和主动混沌实验来构建准确的因果图与结构方程模型(SEM)。当在线阶段观察到性能异常时,因果图能够实现定性根本原因识别(例如高CPU使用率),而SEM能够实现定量反事实分析(例如,确定“当CPU使用率降低至45%时,性能恢复至正常水平”)。在常见合成数据集上,PERFCE显著优于先前的工作;我们在真实数据库MySQL和TiDB上的评估表明,PERFCE具有高准确性且开销适中。