ZooKeeper is a coordination service, widely used as a backbone of various distributed systems. Though its reliability is of critical importance, testing is insufficient for an industrial-strength system of the size and complexity of ZooKeeper, and deep bugs can still be found. To this end, we resort to formal TLA+ specifications to further improve the reliability of ZooKeeper. Our primary objective is usability and automation, rather than full verification. We incrementally develop three levels of specifications for ZooKeeper. We first obtain the protocol specification, which unambiguously specify the Zab protocol behind ZooKeeper. We then proceed to a finer grain and obtain the system specification, which serves as the super-doc for system development. In order to further leverage the model-level specification to improve the reliability of the code-level implementation, we develop the test specification, which guides the explorative testing of the ZooKeeper implementation. The formal specifications help eliminate the ambiguities in the protocol design and provide comprehensive system documentation. They also help find new critical deep bugs in system implementation, which are beyond the reach of state-of-the-art testing techniques.
翻译:ZooKeeper是一种协调服务,广泛用作各种分布式系统的骨干。尽管其可靠性至关重要,但对于像ZooKeeper这样规模和复杂度的工业级系统,测试仍显不足,深层缺陷仍可能被发现。为此,我们借助形式化的TLA+规范进一步提升ZooKeeper的可靠性。我们的主要目标是可用性和自动化,而非完全验证。我们逐步为ZooKeeper开发了三个层次的规范。首先获得协议规范,它明确规定了ZooKeeper背后的Zab协议。然后,我们细化至系统规范,作为系统开发的超级文档。为了进一步利用模型级规范提升代码级实现的可靠性,我们开发了测试规范,指导ZooKeeper实现的探索性测试。这些形式化规范有助于消除协议设计中的歧义,并提供全面的系统文档。它们还帮助发现系统实现中新的关键深层缺陷,这些缺陷超出了现有最先进测试技术的覆盖面。