软件系统中亚稳态故障的形式化分析 (Formal Analysis of Metastable Failures in Software Systems)

Many large-scale software systems demonstrate metastable failures. In this class of failures, a stressor such as a temporary spike in workload causes the system performance to drop and, subsequently, the system performance continues to remain low even when the stressor is removed. These failures have been reported by many large corporations and considered to be a rare but catastrophic source of availability outages in cloud systems. In this paper, we provide the mathematical foundations of metastability in request-response server systems. We model such systems using a domain-specific language. We show how to construct continuous-time Markov chains (CTMCs) that approximate the semantics of the programs through modeling and data-driven calibration. We use the structure of the CTMC models to provide a visualization of the qualitative behavior of the model. The visualization is a surprisingly effective way to identify system parameterizations that cause a system to show metastable behaviors. We complement the qualitative analysis with quantitative predictions. We provide a formal notion of metastable behaviors based on escape probabilities, and show that metastable behaviors are related to the eigenvalue structure of the CTMC. Our characterization leads to algorithmic tools to predict recovery times in metastable models of server systems. We have implemented our technique in a tool for the modeling and analysis of server systems. Through models inspired by failures in real request-response systems, we show that our qualitative visual analysis captures and predicts many instances of metastability that were observed in the field in a matter of milliseconds. Our algorithms confirm that recovery times surge as the system parameters approach metastable modes in the dynamics.

翻译：许多大规模软件系统表现出亚稳态故障。在这类故障中，诸如工作负载的暂时激增等压力源会导致系统性能下降，随后即使压力源被移除，系统性能仍持续维持在低位。许多大型企业已报告此类故障，并认为这是云系统中罕见但具有灾难性的可用性中断根源。本文为请求-响应服务器系统中的亚稳态现象提供了数学基础。我们使用领域特定语言对此类系统进行建模。我们展示了如何通过建模和数据驱动校准，构建连续时间马尔可夫链来近似程序的语义。利用CTMC模型的结构，我们提供了模型定性行为的可视化呈现。这种可视化是识别导致系统呈现亚稳态行为的系统参数配置的惊人有效方法。我们通过定量预测对定性分析进行补充。基于逃逸概率提出了亚稳态行为的正式定义，并证明亚稳态行为与CTMC的特征值结构相关。我们的特征描述引出了用于预测服务器系统亚稳态模型中恢复时间的算法工具。我们已将所提技术实现为服务器系统建模与分析工具。通过受真实请求-响应系统故障启发的模型，我们证明定性可视化分析能在毫秒级时间内捕获并预测实际环境中观察到的众多亚稳态实例。我们的算法证实，当系统参数接近动态过程中的亚稳态模式时，恢复时间会急剧增加。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/