Widely deployed consensus protocols in the cloud are often leader-based and optimized for low latency under synchronous network conditions. However, cloud networks can experience disruptions such as network partitions, high-loss links, and configuration errors. These disruptions interfere with the operation of leader-based protocols, as their view change mechanisms interrupt the normal case replication and cause the system to stall. This paper proposes RACS, a novel randomized consensus protocol that ensures robustness against adversarial network conditions. RACS achieves optimal one-round trip latency under synchronous network conditions while remaining resilient to adversarial network conditions. RACS follows a simple design inspired by Raft, the most widely used consensus protocol in the cloud, and therefore enables seamless integration with the existing cloud software stack -- a goal no previous asynchronous protocol has successfully achieved. Experiments with a prototype deployed on Amazon EC2 confirm that RACS achieves a throughput of 28k cmd/sec under adversarial cloud network conditions, whereas existing leader-based protocols such as Multi-Paxos and Raft provide less than 2.8k cmd/sec. Under synchronous network conditions, RACS matches the performance of Multi-Paxos and Raft, achieving a throughput of 200k cmd/sec with a latency of 300ms, confirming that RACS introduces no unnecessary overhead. Finally, SADL-RACS-an optimized version of RACS designed for high performance and robustness-achieves an impressive throughput of 500k cmd/sec under synchronous network conditions and 196k cmd/sec under adversarial network conditions, further enhancing both performance and robustness.
翻译:云环境中广泛部署的共识协议通常采用领导者模式,并在同步网络条件下针对低延迟进行优化。然而,云网络可能遭遇诸如网络分区、高丢包链路及配置错误等中断。这些中断会干扰基于领导者的协议运行,因为其视图变更机制会中断常规情况下的复制流程,导致系统停滞。本文提出RACS,一种新颖的随机化共识协议,确保在对抗性网络条件下保持稳健性。RACS在同步网络条件下实现最优的单轮往返延迟,同时保持对对抗性网络条件的弹性。RACS遵循受Raft(云中最广泛使用的共识协议)启发的简洁设计,因此能够与现有云软件栈无缝集成——这是以往任何异步协议均未能成功实现的目标。在亚马逊EC2上部署原型的实验证实,RACS在对抗性云网络条件下达到28k cmd/sec的吞吐量,而现有的基于领导者的协议(如Multi-Paxos和Raft)则提供不足2.8k cmd/sec的吞吐量。在同步网络条件下,RACS的性能与Multi-Paxos和Raft相当,实现200k cmd/sec的吞吐量及300ms的延迟,证实RACS未引入不必要的开销。最后,SADL-RACS——为高性能与稳健性优化的RACS改进版本——在同步网络条件下实现500k cmd/sec的卓越吞吐量,在对抗性网络条件下达到196k cmd/sec的吞吐量,进一步提升了性能与稳健性。