Identifying Performance Issues in Cloud Service Systems Based on Relational-Temporal Features

Cloud systems are susceptible to performance issues, which may cause service-level agreement violations and financial losses. In current practice, crucial metrics are monitored periodically to provide insight into the operational status of components. Identifying performance issues is often formulated as an anomaly detection problem, which is tackled by analyzing each metric independently. However, this approach overlooks the complex dependencies existing among cloud components. Some graph neural network-based methods take both temporal and relational information into account, however, the correlation violations in the metrics that serve as indicators of underlying performance issues are difficult for them to identify. Furthermore, a large volume of components in a cloud system results in a vast array of noisy metrics. This complexity renders it impractical for engineers to fully comprehend the correlations, making it challenging to identify performance issues accurately. To address these limitations, we propose Identifying Performance Issues based on Relational-Temporal Features (ISOLATE ), a learning-based approach that leverages both the relational and temporal features of metrics to identify performance issues. In particular, it adopts a graph neural network with attention to characterizing the relations among metrics and extracts long-term and multi-scale temporal patterns using a GRU and a convolution network, respectively. The learned graph attention weights can be further used to localize the correlation-violated metrics. Moreover, to relieve the impact of noisy data, ISOLATE utilizes a positive unlabeled learning strategy that tags pseudo-labels based on a small portion of confirmed negative examples. Extensive evaluation on both public and industrial datasets shows that ISOLATE outperforms all baseline models with 0.945 F1-score and 0.920 Hit rate@3.

翻译：云系统易受性能问题影响，可能导致服务等级协议违约及经济损失。当前实践中，通常对关键指标进行周期性监控以了解组件运行状态。性能问题识别常被构建为异常检测问题，并通过独立分析各指标来解决。然而，该方法忽略了云组件间存在的复杂依赖关系。部分基于图神经网络的方法虽同时考虑了时序与关系信息，但难以识别作为潜在性能问题指示器的指标间相关性违例现象。此外，云系统中海量组件产生大量含噪声指标，其复杂性使工程师难以完全理解指标间关联，从而阻碍性能问题的精准识别。为突破这些局限，我们提出基于关系-时序特征的性能问题识别方法（ISOLATE），这是一种基于学习的方法，通过综合利用指标的关联特征与时序特征来识别性能问题。该方法采用带注意力机制的图神经网络刻画指标间关联关系，并分别使用GRU和卷积网络提取长期与多尺度时序模式。习得的图注意力权重可进一步用于定位违反相关性的指标。此外，为降低噪声数据影响，ISOLATE采用正例无标签学习策略，基于少量已确认负例生成伪标签。在公开数据集与工业数据集上的大量实验表明，ISOLATE以0.945的F1分数和0.920的Hit rate@3优于所有基线模型。