We study implementations of basic fault-tolerant primitives, such as consensus and registers, in message-passing systems subject to process crashes and a broad range of communication failures. Our results characterize the necessary and sufficient conditions for implementing these primitives as a function of the connectivity constraints and synchrony assumptions. Our main contribution is a new algorithm for partially synchronous consensus that is resilient to process crashes and channel failures and is optimal in its connectivity requirements. In contrast to prior work, our algorithm assumes the most general model of message loss where faulty channels are flaky, i.e., can lose messages without any guarantee of fairness. This failure model is particularly challenging for consensus algorithms, as it rules out standard solutions based on leader oracles and failure detectors. To circumvent this limitation, we construct our solution using a new variant of the recently proposed view synchronizer abstraction, which we adapt to the crash-prone setting with flaky channels.
翻译:我们研究了在消息传递系统中实现基本容错原语(如共识和寄存器)的方法,该系统面临进程崩溃和广泛的通信故障。我们的结果刻画了在考虑连接约束和同步假设的情况下,实现这些原语的必要充分条件。主要贡献是一种新的部分同步共识算法,该算法能够容忍进程崩溃和信道故障,并在连接需求方面达到最优。与先前工作不同,我们的算法假设了最通用的消息丢失模型,其中故障信道具有间歇性(flaky),即可能在不保证公平性的情况下丢失消息。这种故障模型对共识算法尤其具有挑战性,因为它排除了基于领导者预言或故障检测器的标准解决方案。为克服该限制,我们利用最近提出的视图同步器抽象的新变体构建解决方案,并将其适配至存在间歇性信道的崩溃易发场景。