Modern distributed systems rely on consensus protocols to build a fault-tolerant-core upon which they can build applications. Consensus protocols are correct under a specific failure model, where up to $f$ machines can fail. We argue that this $f$-threshold failure model oversimplifies the real world and limits potential opportunities to optimize for cost or performance. We argue instead for a probabilistic failure model that captures the complex and nuanced nature of faults observed in practice. Probabilistic consensus protocols can explicitly leverage individual machine \textit{failure curves} and explore side-stepping traditional bottlenecks such as majority quorum intersection, enabling systems that are more reliable, efficient, cost-effective, and sustainable.
翻译:现代分布式系统依赖共识协议构建容错核心,并在此基础上构建应用程序。共识协议在特定的故障模型下具有正确性,该模型假设最多有 $f$ 台机器可能发生故障。我们认为这种基于 $f$ 阈值的故障模型过度简化了现实世界,并限制了在成本或性能方面进行优化的潜在机会。相反,我们主张采用概率化故障模型,以捕捉实践中观察到的复杂且细微的故障特性。概率化共识协议能够显式利用个体机器的\textit{故障曲线},并探索规避传统瓶颈(如多数仲裁集交集)的方法,从而构建更可靠、高效、经济且可持续的系统。