Large-scale, fault-tolerant, distributed systems are the backbone for many critical software services. Since they must execute correctly in a possibly adversarial environment with arbitrary communication delays and failures, the underlying algorithms are intricate. In particular, achieving consistency and data retention relies on intricate consensus (state machine replication) protocols. Ensuring the reliability of implementations of such protocols remains a significant challenge because of the enormous number of exceptional conditions that may arise in production. We propose a methodology and a tool called Netrix for testing such implementations that aims to exploit programmer's knowledge to improve coverage, enables robust bug reproduction, and can be used in regression testing across different versions of an implementation. As evaluation, we apply our tool to a popular proof of stake blockchain protocol, Tendermint, which relies on a Byzantine consensus algorithm, a benign consensus algorithm, Raft, and BFT-Smart. We were able to identify 4 deviations of the Tendermint implementation from the protocol specification and check their absence on an updated implementation. Additionally, we were able to reproduce 4 previously known bugs in Raft.
翻译:大规模、容错的分布式系统是许多关键软件服务的支柱。由于这些系统必须在可能存在的敌对环境中运行,并面临任意通信延迟和故障,因此其底层算法十分复杂。特别是,实现一致性和数据持久性依赖于复杂的共识(状态机复制)协议。由于生产环境中可能出现大量异常情况,确保此类协议实现的可靠性仍然是一项重大挑战。我们提出了一种方法和名为Netrix的工具来测试这类实现,旨在利用程序员的知识提高覆盖率,实现稳健的缺陷复现,并可用于同一实现不同版本之间的回归测试。作为评估,我们将该工具应用于一个流行的权益证明区块链协议Tendermint(其依赖于拜占庭共识算法)、一个良性共识算法Raft以及BFT-Smart。我们成功识别了Tendermint实现中偏离协议规范的4处差异,并在更新后的实现中确认了这些差异已不存在。此外,我们还成功复现了Raft中此前已知的4个缺陷。