Many large-scale software systems demonstrate metastable failures. In this class of failures, a stressor such as a temporary spike in workload causes the system performance to drop and, subsequently, the system performance continues to remain low even when the stressor is removed. These failures have been reported by many large corporations and considered to be a rare but catastrophic source of availability outages in cloud systems. In this paper, we provide the mathematical foundations of metastability in request-response server systems. We model such systems using a domain-specific language. We show how to construct continuous-time Markov chains (CTMCs) that approximate the semantics of the programs through modeling and data-driven calibration. We use the structure of the CTMC models to provide a visualization of the qualitative behavior of the model. The visualization is a surprisingly effective way to identify system parameterizations that cause a system to show metastable behaviors. We complement the qualitative analysis with quantitative predictions. We provide a formal notion of metastable behaviors based on escape probabilities, and show that metastable behaviors are related to the eigenvalue structure of the CTMC. Our characterization leads to algorithmic tools to predict recovery times in metastable models of server systems. We have implemented our technique in a tool for the modeling and analysis of server systems. Through models inspired by failures in real request-response systems, we show that our qualitative visual analysis captures and predicts many instances of metastability that were observed in the field in a matter of milliseconds. Our algorithms confirm that recovery times surge as the system parameters approach metastable modes in the dynamics.
翻译:许多大规模软件系统表现出亚稳态故障。在这类故障中,诸如工作负载的暂时激增等压力源会导致系统性能下降,随后即使压力源被移除,系统性能仍持续维持在低位。许多大型企业已报告此类故障,并认为这是云系统中罕见但具有灾难性的可用性中断根源。本文为请求-响应服务器系统中的亚稳态现象提供了数学基础。我们使用领域特定语言对此类系统进行建模。我们展示了如何通过建模和数据驱动校准,构建连续时间马尔可夫链来近似程序的语义。利用CTMC模型的结构,我们提供了模型定性行为的可视化呈现。这种可视化是识别导致系统呈现亚稳态行为的系统参数配置的惊人有效方法。我们通过定量预测对定性分析进行补充。基于逃逸概率提出了亚稳态行为的正式定义,并证明亚稳态行为与CTMC的特征值结构相关。我们的特征描述引出了用于预测服务器系统亚稳态模型中恢复时间的算法工具。我们已将所提技术实现为服务器系统建模与分析工具。通过受真实请求-响应系统故障启发的模型,我们证明定性可视化分析能在毫秒级时间内捕获并预测实际环境中观察到的众多亚稳态实例。我们的算法证实,当系统参数接近动态过程中的亚稳态模式时,恢复时间会急剧增加。