Metastable failures are hard to detect, prevent, and mitigate. During a metastable failure, a system exhibits self-sustaining bad behavior even in the absence of adversarial conditions. Prior work focuses on symptoms and has portrayed metastable failures as instances of self-sustaining overload. This characterization leaves the underlying failure causes and dynamics unknown, and does not account for metastable failures that do not manifest as overload. We present the first causal characterization of metastable failures by identifying their origin in metastable faults, i.e., structural destabilizing cycles of interaction among systems components that, in isolation, are stabilizing. Metastable failures arise when scheduling decisions let these destabilizing interactions gain the upper hand over the individual components' stabilizing tendencies. We then derive a methodology to predict metastable failures, and to build metastable-fault-tolerant (MFT) systems. We apply our methodology to three case studies, showcasing the generality of our results.
翻译:亚稳态失效难以检测、预防和缓解。在亚稳态失效期间,系统即使在没有对抗条件的情况下也会表现出自我持续的异常行为。以往工作侧重于症状,并将亚稳态失效描述为自我持续过载的实例。这种描述掩盖了潜在的失效原因和动态过程,也无法解释那些不以过载形式表现的亚稳态失效。我们首次提出亚稳态失效的因果特征描述,通过识别其根源在于亚稳态故障,即系统组件间结构性失稳的交互循环,而这些组件在孤立状态下是稳定的。当调度决策使得这些失稳交互压过单个组件的稳定趋势时,亚稳态失效便会产生。我们进而推导出一种预测亚稳态失效的方法,并构建了亚稳态容错系统。我们将该方法应用于三个案例研究,展示了我们结果的通用性。