Large-scale, fault-tolerant, distributed systems are the backbone for many critical software services. Since they must execute correctly in a possibly adversarial environment with arbitrary communication delays and failures, the underlying algorithms are intricate. In particular, achieving consistency and data retention relies on intricate consensus (state machine replication) protocols. Ensuring the reliability of implementations of such protocols remains a significant challenge because of the enormous number of exceptional conditions that may arise in production. We propose a methodology and a tool called Netrix for testing such implementations that aims to exploit programmer's knowledge to improve coverage, enables robust bug reproduction, and can be used in regression testing across different versions of an implementation. As evaluation, we apply our tool to a popular proof of stake blockchain protocol, Tendermint, which relies on a Byzantine consensus algorithm, a benign consensus algorithm, Raft, and BFT-Smart. We were able to identify deviations of the Tendermint implementation from the protocol specification and verify corrections on an updated implementation. Additionally, we were able to reproduce previously known bugs in Raft.
翻译:大规模、容错、分布式系统是许多关键软件服务的支柱。由于它们必须在可能对抗性的环境中运行,且面临任意通信延迟和故障,因此底层算法十分复杂。特别是,实现一致性和数据持久性依赖于复杂的共识(状态机复制)协议。由于生产环境中可能出现海量异常情况,确保此类协议实现的可靠性仍是一个重大挑战。我们提出了一种名为Netrix的方法论和工具,用于测试此类实现。该工具旨在利用程序员的专业知识来提高覆盖率,实现可靠的错误复现,并可用于跨不同版本实现的回归测试。作为评估,我们将该工具应用于一个流行的权益证明区块链协议Tendermint(它依赖于拜占庭共识算法)、一个良性共识算法Raft以及BFT-Smart。我们成功识别了Tendermint实现与协议规范之间的偏差,并在更新后的实现上验证了修正。此外,我们还成功复现了Raft中先前已知的错误。