A case study on parametric verification of failure detectors

Partial synchrony is a model of computation in many distributed algorithms and modern blockchains. These algorithms are typically parameterized in the number of participants, and their correctness requires the existence of bounds on message delays and on the relative speed of processes after reaching Global Stabilization Time. These characteristics make partially synchronous algorithms parameterized in the number of processes, and parametric in time bounds, which render automated verification of partially synchronous algorithms challenging. In this paper, we present a case study on formal verification of both safety and liveness of the Chandra and Toueg failure detector that is based on partial synchrony. To this end, we first introduce and formalize the class of symmetric point-to-point algorithms that contains the failure detector. Second, we show that these symmetric point-to-point algorithms have a cutoff, and the cutoff results hold in three models of computation: synchrony, asynchrony, and partial synchrony. As a result, one can verify them by model checking small instances, but the verification problem stays parametric in time. Next, we specify the failure detector and the partial synchrony assumptions in three frameworks: TLA+, IVy, and counter automata. Importantly, we tune our modeling to use the strength of each method: (1) We are using counters to encode message buffers with counter automata, (2) we are using first-order relations to encode message buffers in IVy, and (3) we are using both approaches in TLA+. By running the tools for TLA+ and counter automata, we demonstrate safety for fixed time bounds. By running IVy, we prove safety for arbitrary time bounds. Moreover, we show how to verify liveness of the failure detector by reducing the verification problem to safety verification. Thus, both properties are verified by developing inductive invariants with IVy.

翻译：部分同步性是许多分布式算法和现代区块链中采用的一种计算模型。这类算法通常以参与者数量为参数，其正确性要求消息延迟及进程在达到全局稳定时间后的相对速度存在界限。这些特性使得部分同步算法在进程数量和时间界限上均呈现参数化特征，给此类算法的自动化验证带来挑战。本文以基于部分同步性的Chandra-Toueg故障检测器为研究对象，对其安全性和活性的形式化验证进行案例研究。为此，我们首先引入并形式化包含该故障检测器的对称点对点算法类。其次，证明这类对称点对点算法具有截断性质，且该截断结果在同步、异步和部分同步三种计算模型下均成立。基于此，可通过模型检测小规模实例进行验证，但验证问题仍保持时间参数化特性。随后，我们在TLA+、IVy和计数器自动机三种框架中对故障检测器及部分同步性假设进行规约。关键在于，我们通过模型调优以充分利用各方法的优势：（1）使用计数器对计数器自动机中的消息缓冲区进行编码，（2）在IVy中采用一阶关系编码消息缓冲区，（3）在TLA+中综合运用两种方法。通过运行TLA+和计数器自动机工具，我们在固定时间界限下验证了安全性；通过运行IVy，我们在任意时间界限下证明了安全性。此外，我们展示了通过将活性验证问题归约为安全性验证来验证故障检测器活性的方法。最终，通过使用IVy开发归纳不变量，完成了对两类性质的形式化验证。