Failure transparency enables users to reason about distributed systems at a higher level of abstraction, where complex failure-handling logic is hidden. This is especially true for stateful dataflow systems, which are the backbone of many cloud applications. In particular, this paper focuses on proving failure transparency in Apache Flink, a popular stateful dataflow system. Even though failure transparency is a critical aspect of Apache Flink, to date it has not been formally proven. Showing that the failure transparency mechanism is correct, however, is challenging due to the complexity of the mechanism itself. Nevertheless, this complexity can be effectively hidden behind a failure transparent programming interface. To show that Apache Flink is failure transparent, we model it in small-step operational semantics. Next, we provide a novel definition of failure transparency based on observational explainability, a concept which relates executions according to their observations. Finally, we provide a formal proof of failure transparency for the implementation model; i.e., we prove that the failure-free model correctly abstracts from the failure-related details of the implementation model. We also show liveness of the implementation model under a fair execution assumption. These results are a first step towards a verified stack for stateful dataflow systems.
翻译:故障透明性使用户能够在更高的抽象层次上推理分布式系统,其中复杂的故障处理逻辑被隐藏。这对于作为许多云应用骨干的状态化数据流系统尤其如此。具体而言,本文聚焦于证明流行状态化数据流系统Apache Flink中的故障透明性。尽管故障透明性是Apache Flink的一个关键方面,但迄今为止尚未得到正式证明。然而,由于机制本身的复杂性,证明故障透明机制的正确性具有挑战性。尽管如此,这种复杂性可以有效地隐藏在故障透明的编程接口之后。为了证明Apache Flink具有故障透明性,我们使用小步操作语义对其进行建模。接着,我们基于观测可解释性——一个根据执行过程的观测结果来关联执行的概念——提出了一个新颖的故障透明性定义。最后,我们为其实施模型提供了故障透明性的形式化证明;即,我们证明了无故障模型能够正确地抽象掉实施模型中与故障相关的细节。此外,在公平执行假设下,我们还证明了实施模型的活性。这些结果为构建可验证的状态化数据流系统栈迈出了第一步。