Reliable distributed systems require replication and consensus among distributed processes to tolerate process and communication failures. Understanding and assuring the correctness of protocols for replication and consensus have been a significant challenge. This paper describes the precise specification and runtime checking of Derecho, a more recent, sophisticated protocol for fast replication and consensus for cloud services. A precise specification must fill in missing details and resolve ambiguities in English and pseudocode algorithm descriptions while also faithfully following the descriptions. To help check the correctness of the protocol, we also performed careful manual analysis and increasingly systematic runtime checking. We obtain a complete specification that is directly executable, and we discover and fix a number of issues in the pseudocode. These results were facilitated by the already detailed pseudocode of Derecho and made possible by using DistAlgo, a language that allows distributed algorithms to be easily and clearly expressed and directly executed.
翻译:可靠的分布式系统需要分布式进程之间的复制与共识机制,以应对进程和通信故障。理解和确保复制及共识协议的正确性一直是一项重大挑战。本文描述了Derecho——一种较新的、面向云服务快速复制与共识的复杂协议——的精确规范与运行时检查。精确的规范需要填补现有英文及伪代码算法描述中的遗漏细节并消除歧义,同时忠实遵循原有描述。为协助验证协议正确性,我们还进行了细致的人工分析以及渐进式系统化的运行时检查。我们获得了一个可直接执行的完整规范,并发现并修复了伪代码中的若干问题。这些成果得益于Derecho已有的详尽伪代码,并通过使用DistAlgo(一种能轻松清晰表达分布式算法并直接执行的语言)得以实现。