Metastable failures are hard to detect, prevent, and mitigate. During a metastable failure, a system exhibits self-sustaining bad behavior even in the absence of adversarial conditions. Prior work focuses on symptoms and has portrayed metastable failures as instances of self-sustaining overload. This characterization leaves the underlying failure causes and dynamics unknown, and does not account for metastable failures that do not manifest as overload. We present the first causal characterization of metastable failures by identifying their origin in metastable faults, i.e., structural destabilizing cycles of interaction among systems components that, in isolation, are stabilizing. Metastable failures arise when scheduling decisions let these destabilizing interactions gain the upper hand over the individual components' stabilizing tendencies. We then derive a methodology to predict metastable failures, and to build metastable-fault-tolerant (MFT) systems. We apply our methodology to three case studies, showcasing the generality of our results.
翻译:亚稳态故障难以检测、预防和缓解。在亚稳态故障期间,即使不存在对抗性条件,系统也会表现出自我维持的不良行为。先前的研究聚焦于症状,并将亚稳态故障描述为自我维持过载的实例。这种表征使得潜在的故障原因和动态机制仍不明确,并且无法解释那些不以过载形式呈现的亚稳态故障。我们首次提出亚稳态故障的因果表征,通过识别其根源在于亚稳态故障(即系统组件间逐层不稳定的相互作用循环,而这些组件本身具有稳定性)。当调度决策让这些不稳定性相互作用在与个体组件的稳定倾向对抗中占据上风时,亚稳态故障便会发生。随后,我们推导出一种预测亚稳态故障的方法论,并构建了亚稳态故障容错(MFT)系统。我们将该方法论应用于三个案例研究,展示了我们结果的通用性。