Ensuring the reliability of cloud systems is critical for both cloud vendors and customers. Cloud systems often rely on virtualization techniques to create instances of hardware resources, such as virtual machines. However, virtualization hinders the observability of cloud systems, making it challenging to diagnose platform-level issues. To improve system observability, we propose to infer functional clusters of instances, i.e., groups of instances having similar functionalities. We first conduct a pilot study on a large-scale cloud system, i.e., Huawei Cloud, demonstrating that instances having similar functionalities share similar communication and resource usage patterns. Motivated by these findings, we formulate the identification of functional clusters as a clustering problem and propose a non-intrusive solution called Prism. Prism adopts a coarse-to-fine clustering strategy. It first partitions instances into coarse-grained chunks based on communication patterns. Within each chunk, Prism further groups instances with similar resource usage patterns to produce fine-grained functional clusters. Such a design reduces noises in the data and allows Prism to process massive instances efficiently. We evaluate Prism on two datasets collected from the real-world production environment of Huawei Cloud. Our experiments show that Prism achieves a v-measure of ~0.95, surpassing existing state-of-the-art solutions. Additionally, we illustrate the integration of Prism within monitoring systems for enhanced cloud reliability through two real-world use cases.
翻译:摘要:确保云系统的可靠性对云供应商和客户都至关重要。云系统通常依赖虚拟化技术创建硬件资源实例(如虚拟机)。然而,虚拟化阻碍了云系统的可观测性,使得诊断平台级问题变得困难。为提升系统可观测性,我们提出推断实例的功能集群,即功能相似的实例组。我们首先在大型云系统(华为云)上进行先导研究,表明功能相似的实例具有相似的通信和资源使用模式。基于此发现,我们将功能聚类识别问题形式化为聚类问题,并提出一种非侵入式解决方案Prism。Prism采用由粗到细的聚类策略:首先根据通信模式将实例划分为粗粒度分块,然后在每个分块内进一步根据资源使用模式聚类实例,生成细粒度功能集群。该设计可减少数据噪声,并支持Prism高效处理海量实例。我们使用华为云生产环境采集的两个数据集对Prism进行评估。实验表明,Prism的v-measure值达到约0.95,超越现有最优方案。此外,通过两个实际用例,我们展示了Prism如何集成到监控系统中以增强云可靠性。