This paper presents our experience specifying and verifying the correctness of ZooKeeper, a complex and evolving distributed coordination system. We use TLA+ to model fine-grained behaviors of ZooKeeper and use the TLC model checker to verify its correctness properties; we also check conformance between the model and code. The fundamental challenge is to balance the granularity of specifications and the scalability of model checking -- fine-grained specifications lead to state-space explosion, while coarse-grained specifications introduce model-code gaps. To address this challenge, we write specifications with different granularities for composable modules, and compose them into mixed-grained specifications based on specific scenarios. For example, to verify code changes, we compose fine-grained specifications of changed modules and coarse-grained specifications that abstract away details of unchanged code with preserved interactions. We show that writing multi-grained specifications is a viable practice and can cope with model-code gaps without untenable state space, especially for evolving software where changes are typically local and incremental. We detected six severe bugs that violate five types of invariants and verified their code fixes; the fixes have been merged to ZooKeeper. We also improve the protocol design to make it easy to implement correctly.
翻译:本文介绍了我们在规约与验证ZooKeeper这一复杂且持续演化的分布式协调系统的正确性方面的工作经验。我们使用TLA+对ZooKeeper的细粒度行为进行建模,并利用TLC模型检测器验证其正确性属性;同时我们还检查了模型与代码之间的一致性。核心挑战在于平衡规约的粒度与模型检测的可扩展性——细粒度规约会引发状态空间爆炸,而粗粒度规约则会导致模型与代码之间的鸿沟。为解决这一挑战,我们为可组合的模块编写了不同粒度的规约,并根据具体场景将它们组合成混合粒度的规约。例如,为验证代码变更,我们将变更模块的细粒度规约与抽象了未变更代码细节但保留其交互行为的粗粒度规约进行组合。我们证明了编写多粒度规约是一种可行的实践方法,能够在避免状态空间爆炸的同时应对模型与代码之间的鸿沟,尤其适用于变更通常具有局部性和增量性的演化中软件。我们检测出六个违反五类不变量的严重缺陷,并验证了其代码修复方案;这些修复已合并至ZooKeeper主分支。此外,我们还改进了协议设计,使其更易于正确实现。