We develop deterministic algorithms for the problems of consensus, gossiping and checkpointing with nodes prone to failing. Distributed systems are modeled as synchronous complete networks. Failures are represented either as crashes or authenticated Byzantine faults. The algorithmic goal is to have both linear running time and linear amount of communication for as large an upper bound $t$ on the number of faults as possible, with respect to the number of nodes~$n$. For crash failures, these bounds of optimality are $t=\mathcal{O}(\frac{n}{\log n})$ for consensus and $t=\mathcal{O}(\frac{n}{\log^2 n})$ for gossiping and checkpointing, while the running time for each algorithm is $\Theta(t+\log n)$. For the authenticated Byzantine model of failures, we show how to accomplish both linear running time and communication for $t=\mathcal{O}(\sqrt{n})$. We show how to implement the algorithms in the single-port model, in which a node may choose only one other node to send/receive a message to/from in a round, such as to preserve the range of running time and communication optimality. We prove lower bounds to show the optimality of some performance bounds.
翻译:我们针对节点易发生故障的共识、八卦传播及检查点问题,提出了确定性算法。分布式系统被建模为同步完全网络,故障表现为崩溃或经过认证的拜占庭故障。算法目标是尽可能在满足故障数量上限t(相对于节点数n)的条件下,实现线性运行时间与线性通信量。对于崩溃故障,共识问题的最优边界为t=O(n/logn),八卦传播与检查点问题的最优边界为t=O(n/log²n),各算法的运行时间为Θ(t+logn)。针对经过认证的拜占庭故障模型,我们展示了如何在t=O(√n)条件下同时实现线性运行时间与线性通信量。此外,我们还展示了如何在单端口模型(即每轮中节点仅能选择另一个节点进行消息收发)中实现这些算法,以保持运行时间与通信最优性的范围。最后,通过下界证明验证了部分性能边界的最优性。