AI-based monitoring has become crucial for cloud-based services due to its scale. A common approach to AI-based monitoring is to detect causal relationships among service components and build a causal graph. Availability of domain information makes cloud systems even better suited for such causal detection approaches. In modern cloud systems, however, auto-scalers dynamically change the number of microservice instances, and a load-balancer manages the load on each instance. This poses a challenge for off-the-shelf causal structure detection techniques as they neither incorporate the system architectural domain information nor provide a way to model distributed compute across varying numbers of service instances. To address this, we develop CausIL, which detects a causal structure among service metrics by considering compute distributed across dynamic instances and incorporating domain knowledge derived from system architecture. Towards the application in cloud systems, CausIL estimates a causal graph using instance-specific variations in performance metrics, modeling multiple instances of a service as independent, conditional on system assumptions. Simulation study shows the efficacy of CausIL over baselines by improving graph estimation accuracy by ~25% as measured by Structural Hamming Distance whereas the real-world dataset demonstrates CausIL's applicability in deployment settings.
翻译:摘要:基于AI的监控因其规模性已成为云服务的关键。一种常见的AI监控方法是通过检测服务组件间的因果关系来构建因果图。领域信息的可用性使云系统更适合采用此类因果检测方法。然而,在现代云系统中,自动伸缩器会动态改变微服务实例数量,而负载均衡器则管理各实例上的负载。这对现成的因果结构检测技术构成挑战,因为它们既无法整合系统架构领域信息,也无法对跨不同数量服务实例的分布式计算进行建模。为解决此问题,我们开发了CausIL,它通过考虑跨动态实例的分布式计算并融合系统架构知识来检测服务指标间的因果结构。面向云系统应用场景,CausIL利用性能指标中实例特有的变化来估计因果图,在基于系统假设的条件下将同一服务的多个实例建模为独立实体。仿真研究表明,以结构汉明距离为指标,CausIL的图估计精度较基线提升约25%;真实数据集则验证了CausIL在部署场景中的适用性。