A case study on parametric verification of failure detectors

Partial synchrony is a model of computation in many distributed algorithms and modern blockchains. These algorithms are typically parameterized in the number of participants, and their correctness requires the existence of bounds on message delays and on the relative speed of processes after reaching Global Stabilization Time. These characteristics make partially synchronous algorithms parameterized in the number of processes, and parametric in time bounds, which render automated verification of partially synchronous algorithms challenging. In this paper, we present a case study on formal verification of both safety and liveness of the Chandra and Toueg failure detector that is based on partial synchrony. To this end, we first introduce and formalize the class of symmetric point-to-point algorithms that contains the failure detector. Second, we show that these symmetric point-to-point algorithms have a cutoff, and the cutoff results hold in three models of computation: synchrony, asynchrony, and partial synchrony. As a result, one can verify them by model checking small instances, but the verification problem stays parametric in time. Next, we specify the failure detector and the partial synchrony assumptions in three frameworks: TLA+, IVy, and counter automata. Importantly, we tune our modeling to use the strength of each method: (1) We are using counters to encode message buffers with counter automata, (2) we are using first-order relations to encode message buffers in IVy, and (3) we are using both approaches in TLA+. By running the tools for TLA+ and counter automata, we demonstrate safety for fixed time bounds. By running IVy, we prove safety for arbitrary time bounds. Moreover, we show how to verify liveness of the failure detector by reducing the verification problem to safety verification. Thus, both properties are verified by developing inductive invariants with IVy.

翻译：部分同步是许多分布式算法和现代区块链中的计算模型。这类算法通常以参与者数量为参数化，其正确性要求存在消息延迟上限以及达到全局稳定时间后进程相对速度的上界。这些特性使得部分同步算法在进程数量和时间界限上均具有参数化特性，这给部分同步算法的自动化验证带来了挑战。本文以Chandra和Toueg基于部分同步的故障检测器为例，对其安全性和活性进行形式化验证案例研究。为此，我们首先引入并形式化包含该故障检测器的对称点对点算法类别。其次，证明此类对称点对点算法存在截断性，且截断结果在同步、异步和部分同步三种计算模型下均成立。因此，可通过模型检测小规模实例进行验证，但验证问题仍保持时间参数化特性。接下来，我们在TLA+、IVy和计数器自动机三种框架中规范定义故障检测器及部分同步假设。重要的是，我们针对每种方法的优势进行建模优化：(1) 利用计数器对计数器自动机中的消息缓冲区进行编码；(2) 在IVy中采用一阶关系对消息缓冲区进行编码；(3) 在TLA+中同时使用两种方法。通过运行TLA+和计数器自动机工具，我们证明了固定时间边界下的安全性。通过运行IVy，我们证明了任意时间边界下的安全性。此外，我们展示了如何通过将验证问题归约到安全性验证来验证故障检测器的活性。因此，借助IVy开发归纳不变式，成功验证了这两个属性。