Failure transparency enables users to reason about distributed systems at a higher level of abstraction, where complex failure-handling logic is hidden. This is especially true for stateful dataflow systems, which are the backbone of many cloud applications. In particular, this paper focuses on proving failure transparency in Apache Flink, a popular stateful dataflow system. Even though failure transparency is a critical aspect of Apache Flink, to date it has not been formally proven. Showing that the failure transparency mechanism is correct, however, is challenging due to the complexity of the mechanism itself. Nevertheless, this complexity can be effectively hidden behind a failure transparent programming interface. To show that Apache Flink is failure transparent, we model it in small-step operational semantics. Next, we provide a novel definition of failure transparency based on observational explainability, a concept which relates executions according to their observations. Finally, we provide a formal proof of failure transparency for the implementation model; i.e., we prove that the failure-free model correctly abstracts from the failure-related details of the implementation model. We also show liveness of the implementation model under a fair execution assumption. These results are a first step towards a verified stack for stateful dataflow systems.
翻译:故障透明性使用户能够在更高的抽象层次上推理分布式系统,其中复杂的故障处理逻辑被隐藏。这对于作为众多云应用骨干的状态化数据流系统尤为关键。具体而言,本文聚焦于证明流行状态化数据流系统 Apache Flink 的故障透明性。尽管故障透明性是 Apache Flink 的一个关键方面,但迄今为止尚未得到形式化证明。然而,由于机制本身的复杂性,证明故障透明机制的正确性具有挑战性。尽管如此,这种复杂性可以有效地隐藏在故障透明的编程接口之后。为了证明 Apache Flink 具有故障透明性,我们采用小步操作语义对其进行建模。接着,我们基于观测可解释性提出了一种新颖的故障透明性定义,该概念根据执行过程的观测结果来关联它们。最后,我们为实现模型提供了故障透明性的形式化证明;即,我们证明了无故障模型能够正确地从实现模型的故障相关细节中抽象出来。我们还在公平执行假设下证明了实现模型的活性。这些成果是迈向构建可验证状态化数据流系统栈的第一步。