We present ARCAS (Automated Root Cause Analysis System), a diagnostic platform based on a Domain Specific Language (DSL) built for fast diagnostic implementation and low learning curve. Arcas is composed of a constellation of automated troubleshooting guides (Auto-TSGs) that can execute in parallel to detect issues using product telemetry and apply mitigation in near-real-time. The DSL is tailored specifically to ensure that subject matter experts can deliver highly curated and relevant Auto-TSGs in a short time without having to understand how they will interact with the rest of the diagnostic platform, thus reducing time-to-mitigate and saving crucial engineering cycles when they matter most. This contrasts with platforms like Datadog and New Relic, which primarily focus on monitoring and require manual intervention for mitigation. ARCAS uses a Large Language Model (LLM) to prioritize Auto-TSGs outputs and take appropriate actions, thus suppressing the costly requirement of understanding the general behavior of the system. We explain the key concepts behind ARCAS and demonstrate how it has been successfully used for multiple products across Azure Synapse Analytics and Microsoft Fabric Synapse Data Warehouse.
翻译:本文介绍ARCAS(自动化根因分析系统),这是一个基于领域特定语言(DSL)构建的诊断平台,旨在实现快速诊断部署与低学习门槛。ARCAS由一系列可并行执行的自动化故障排查指南(Auto-TSGs)构成,这些指南能利用产品遥测数据实时检测问题并实施近实时缓解措施。该DSL经过专门设计,确保领域专家能够在短时间内提供高度定制化且精准相关的Auto-TSGs,而无需理解其与诊断平台其他组件的交互机制,从而在关键时刻缩短缓解时间、节约关键工程周期。这与Datadog和New Relic等主要侧重于监控且需人工干预缓解的平台形成鲜明对比。ARCAS采用大语言模型(LLM)对Auto-TSGs输出结果进行优先级排序并执行相应操作,从而避免了对系统整体行为进行成本高昂的认知需求。我们将阐释ARCAS的核心设计理念,并展示其如何在Azure Synapse Analytics与Microsoft Fabric Synapse数据仓库的多个产品中成功应用。