Runtime failure and performance degradation is commonplace in modern cloud systems. For cloud providers, automatically determining the root cause of incidents is paramount to ensuring high reliability and availability as prompt fault localization can enable faster diagnosis and triage for timely resolution. A compelling solution explored in recent work is causal reasoning using causal graphs to capture relationships between varied cloud system performance metrics. To be effective, however, systems developers must correctly define the causal graph of their system, which is a time-consuming, brittle, and challenging task that increases in difficulty for large and dynamic systems and requires domain expertise. Alternatively, automated data-driven approaches have limited efficacy for cloud systems due to the inherent rarity of incidents. In this work, we present Atlas, a novel approach to automatically synthesizing causal graphs for cloud systems. Atlas leverages large language models (LLMs) to generate causal graphs using system documentation, telemetry, and deployment feedback. Atlas is complementary to data-driven causal discovery techniques, and we further enhance Atlas with a data-driven validation step. We evaluate Atlas across a range of fault localization scenarios and demonstrate that Atlas is capable of generating causal graphs in a scalable and generalizable manner, with performance that far surpasses that of data-driven algorithms and is commensurate to the ground-truth baseline.
翻译:在现代云系统中,运行时故障与性能降级是普遍现象。对云服务提供商而言,自动确定事件的根本原因对于确保高可靠性与可用性至关重要,因为及时的故障定位能够实现更快速的诊断与分类,从而及时解决问题。近期研究中探索的一种引人注目的解决方案是利用因果图捕捉各类云系统性能指标间关系的因果推理方法。然而,要使该方法有效,系统开发者必须正确定义其系统的因果图,这是一项耗时、脆弱且极具挑战性的任务——对于大型动态系统而言难度更高,并需要领域专业知识。另一方面,由于故障事件固有的稀缺性,自动化数据驱动方法在云系统中的效能有限。本研究提出Atlas,一种为云系统自动合成因果图的新方法。Atlas利用大型语言模型(LLMs),通过系统文档、遥测数据和部署反馈来生成因果图。Atlas与数据驱动的因果发现技术形成互补,我们进一步通过数据驱动的验证步骤增强Atlas。我们在多种故障定位场景中对Atlas进行评估,结果表明Atlas能够以可扩展且可泛化的方式生成因果图,其性能远超数据驱动算法,并与真实基准水平相当。