Metric Criticality Identification for Cloud Microservices

For Site Reliability Engineers, alerts are typically the first and often the primary indications that a system may not be performing as expected. Once alerts are triggered, Site Reliability Engineers delve into detailed data across various modalities such as metrics, logs, and traces - to diagnose system issues. However, defining an optimal set of alerts is increasingly challenging due to the sheer volume of multi-modal observability data points in large cloud-native systems. Typically, alerts are manually curated, primarily defined on the metrics modality, and heavily reliant on subject matter experts manually navigating through the large state-space of intricate relationships in multi-modal observability data. Such a process renders defining alerts prone to insufficient coverage, potentially missing critical events. Defining alerts is even more challenging with the shift from traditional monolithic architectures to microservice based architectures due to the intricate interplay between microservices governed by the application topology in an ever stochastic environment. To tackle this issue, we take a data driven approach wherein we propose KIMetrix, a system that relies only on historical metric data and lightweight microservice traces to identify microservice metric criticality. KIMetrix significantly aids Subject Matter Experts by identifying a critical set of metrics to define alerts, averting the necessity of weaving through the vast multi-modal observability sphere. KIMetrix delves deep into the metric-trace coupling and leverages information theoretic measures to recommend microservice-metric mappings in a microservice topology-aware manner. Experimental evaluation on state-of-the-art microservice based applications demonstrates the effectiveness of our approach.

翻译：对于站点可靠性工程师而言，告警通常是系统可能未按预期运行的首要和主要指示信号。一旦告警被触发，站点可靠性工程师会深入分析跨多种模态的详细数据——如度量指标、日志和追踪信息——以诊断系统问题。然而，在大型云原生系统中，由于多模态可观测性数据点的庞大规模，定义最优告警集正变得越来越具有挑战性。通常，告警是手动配置的，主要在度量指标模态上定义，并严重依赖领域专家手动梳理多模态可观测性数据中复杂关系所构成的巨大状态空间。这一过程使得告警定义容易覆盖不足，可能遗漏关键事件。随着从传统的单体架构向基于微服务的架构转变，由于在持续随机环境中受应用拓扑结构支配的微服务间存在复杂的相互作用，定义告警变得更加困难。为解决这一问题，我们采用数据驱动的方法，提出了KIMetrix系统，该系统仅依赖历史度量数据和轻量级微服务追踪来识别微服务度量的关键性。KIMetrix通过识别用于定义告警的关键度量集，显著帮助了领域专家，避免了在庞大的多模态可观测性领域中穿行的必要性。KIMetrix深入探究度量-追踪耦合关系，并利用信息论度量，以微服务拓扑感知的方式推荐微服务-度量映射。在最先进的基于微服务的应用上进行的实验评估证明了我们方法的有效性。