Metastable failure is a recent abstraction of a pattern of failures that occurs frequently in real-world distributed storage systems. In this paper, we propose a formal analysis and modeling of metastable failures in replicated storage systems. We focus on a foundational problem in distributed systems -- the problem of consensus -- to have an impact on a large class of systems. Our main contribution is the development of a queuing-based analytical model, MSF-Model, that can be used to characterize and predict metastable failures. MSF-Model integrates novel modeling concepts that allow modeling metastable failures which was interactable to model prior to our work. We also perform real experiments to reproduce and validate our model. Our real experiments show that MSF-Model predicts metastable failures with high accuracy by comparing the real experiment with the predictions from the queuing-based model.
翻译:亚稳态故障是现实世界分布式存储系统中频繁出现的一种故障模式的最新抽象。本文对复制存储系统中的亚稳态故障进行了形式化分析与建模。我们聚焦于分布式系统中的基础问题——共识问题——以期对大规模系统产生影响。主要贡献在于开发了一个基于队列的分析模型MSF-Model,该模型可用于刻画和预测亚稳态故障。MSF-Model融合了创新的建模概念,使得先前难以处理的亚稳态故障建模成为可能。我们还进行了真实实验来复现并验证模型。通过对比真实实验与基于队列模型的预测结果,真实实验表明MSF-Model能够以高精度预测亚稳态故障。