AI-based monitoring has become crucial for cloud-based services due to its scale. A common approach to AI-based monitoring is to detect causal relationships among service components and build a causal graph. Availability of domain information makes cloud systems even better suited for such causal detection approaches. In modern cloud systems, however, auto-scalers dynamically change the number of microservice instances, and a load-balancer manages the load on each instance. This poses a challenge for off-the-shelf causal structure detection techniques as they neither incorporate the system architectural domain information nor provide a way to model distributed compute across varying numbers of service instances. To address this, we develop CausIL, which detects a causal structure among service metrics by considering compute distributed across dynamic instances and incorporating domain knowledge derived from system architecture. Towards the application in cloud systems, CausIL estimates a causal graph using instance-specific variations in performance metrics, modeling multiple instances of a service as independent, conditional on system assumptions. Simulation study shows the efficacy of CausIL over baselines by improving graph estimation accuracy by ~25% as measured by Structural Hamming Distance whereas the real-world dataset demonstrates CausIL's applicability in deployment settings.
翻译:摘要:基于人工智能的监控因其规模优势已成为云服务的关键手段。此类监控的常用方法是通过检测服务组件间的因果关系来构建因果图。领域信息的可用性使得云系统更适合此类因果检测方法。然而,在现代云系统中,自动伸缩器会动态调整微服务实例数量,负载均衡器则管理每个实例的负载。这对现成的因果结构检测技术提出了挑战,因为它们既未整合系统架构的领域信息,也无法为跨可变数量服务实例的分布式计算建模。为解决该问题,我们开发了CausIL,该方法通过考虑跨动态实例的分布式计算并整合系统架构派生的领域知识,检测服务指标间的因果结构。针对云系统应用场景,CausIL利用性能指标的实例级差异估计因果图,将同一服务的多个实例建模为条件独立的(基于系统假设)。仿真研究表明,以结构汉明距离为指标,CausIL的图估计精度较基线方法提升约25%,而真实数据集则验证了CausIL在部署场景中的实用性。