Distributed systems in general and cloud systems in particular, are susceptible to failures that can lead to substantial economic and data losses, security breaches, and even potential threats to human safety. Software ageing is an example of one such vulnerability. It emerges due to routine re-usage of computational systems units which induce fatigue within the components, resulting in an increased failure rate and potential system breakdown. Due to its stochastic nature, ageing cannot be directly measured, instead ageing indicators as proxies are used. While there are dozens of studies on different ageing indicators, their comprehensive comparison in different settings remains underexplored. In this paper, we compare two ageing indicators in OpenStack as a use case. Specifically, our evaluation compares memory usage (including swap memory) and request response time, as readily available indicators. By executing multiple OpenStack deployments with varying configurations, we conduct a series of experiments and analyze the ageing indicators. Comparative analysis through statistical tests provides valuable insights into the strengths and weaknesses of the utilised ageing indicators. Finally, through an in-depth analysis of other OpenStack failures, we identify underlying failure patterns and their impact on the studied ageing indicators.
翻译:分布式系统(尤其是云系统)易受故障影响,可能导致重大经济与数据损失、安全漏洞,甚至危及人类安全。软件老化便是此类脆弱性之一。它源于计算系统单元的常规重复使用,导致组件疲劳,进而引发故障率上升和潜在系统崩溃。由于其随机特性,老化无法直接测量,需借助老化指标作为代理。尽管已有数十项关于不同老化指标的研究,但在不同环境下对它们进行综合比较的探索仍显不足。本文以OpenStack为用例,对比了两种老化指标。具体而言,我们的评估比较了内存使用量(包括交换内存)和请求响应时间这些易获取的指标。通过执行多种配置的OpenStack部署,我们开展了一系列实验并分析了老化指标。基于统计分析对比研究,我们获得了关于所用老化指标优缺点的宝贵见解。最后,通过对其他OpenStack故障的深入分析,我们识别了潜在故障模式及其对所研究老化指标的影响。