Modern cloud-native applications built on microservice architectures present unprecedented challenges for system monitoring and alerting. Site Reliability Engineers (SREs) face the daunting challenge of defining effective monitoring strategies across multitude of metrics to ensure system reliability, a task that traditionally requires extensive manual expertise. The distributed nature of microservices, characterized by stochastic execution patterns and intricate inter-service dependencies, renders the traditional manual approach of navigating the vast metrics landscape computationally and operationally prohibitive. To address this critical challenge, we propose KIMetrix, a data-driven system that automatically identifies minimal yet comprehensive metric subsets to aid SREs in monitoring microservice applications. KIMetrix leverages information-theoretic measures, specifically entropy and mutual information, to quantify metric criticality while considering the stochastic execution patterns inherent in microservice topologies. Our approach operates solely on lightweight metrics and traces, eliminating the need for expensive processing of unstructured logs, and requires no expert-defined training data. Experimental evaluation on state-of-the-art real-world microservice benchmark datasets demonstrates KIMetrix's effectiveness in identifying critical metric subsets that provide comprehensive system coverage while significantly reducing the burden on SREs. By automating the identification of essential metrics for alerting, KIMetrix enables more reliable system monitoring without overwhelming operators with false positives or missing critical system events.
翻译:基于微服务架构构建的现代云原生应用为系统监控与告警带来了前所未有的挑战。站点可靠性工程师(SRE)面临着在多维度指标中定义有效监控策略以确保系统可靠性的艰巨任务,这一任务传统上需要大量人工专业知识。微服务的分布式特性,以其随机执行模式与复杂的服务间依赖关系为特征,使得在庞大的指标空间中采用传统人工方法在计算和操作层面都变得难以实现。为应对这一关键挑战,我们提出了KIMetrix,一种数据驱动的系统,能够自动识别最小且全面的指标子集,以辅助SRE监控微服务应用。KIMetrix利用信息论度量(特别是熵与互信息)来量化指标关键性,同时考虑了微服务拓扑中固有的随机执行模式。我们的方法仅需处理轻量级指标与追踪数据,无需对非结构化日志进行昂贵处理,且不需要专家定义的训练数据。在最先进的真实世界微服务基准数据集上的实验评估表明,KIMetrix能够有效识别关键指标子集,在提供全面系统覆盖的同时,显著减轻SRE的工作负担。通过自动化识别用于告警的核心指标,KIMetrix实现了更可靠的系统监控,既避免了因误报使运维人员不堪重负,又确保了关键系统事件不被遗漏。