Today's massive AI computation loads push heavy data synchronization across sites, i.e., nodes in data centers. Any reduction in such consensus latency can significantly improve the overall performance of desired systems. This consensus challenge explosively peaks at cross-domain sites. In this paper, we proposed CD-Raft to address the cross-domain latency challenge, an optimized Raft protocol for strong consistency in cross-domain sites. CD-Raft can significantly reduce consensus latency by optimizing cross-domain round-trip time (RTT) for reads and writes, as well as carefully positioning the leader node. We verified the correctness of CD-Raft in a formal specification using the TLA+ specification, guaranteeing the strong consistency across sites. We have prototyped CD-Raft and evaluated it using the YCSB benchmark. Empirical results show that compared to the classic Raft, CD-Raft reduces the average latency by 32.90% and (99th percentile) tail latency by 49.24% for renown traces across multiple sites.
翻译:当今大规模人工智能计算负载推动着跨站点(即数据中心节点)的繁重数据同步。任何此类共识延迟的降低都能显著提升目标系统的整体性能。这一共识挑战在跨域站点场景下达到爆发性峰值。本文提出CD-Raft以应对跨域延迟挑战,这是一种针对跨域站点强一致性优化的Raft协议。CD-Raft通过优化读写操作的跨域往返时间(RTT)及精心定位领导者节点,能显著降低共识延迟。我们使用TLA+形式化规范验证了CD-Raft的正确性,确保跨站点强一致性。我们实现了CD-Raft原型系统,并采用YCSB基准进行评估。实验结果表明,相较于经典Raft协议,CD-Raft在多站点典型追踪场景下将平均延迟降低32.90%,尾部延迟(第99百分位)降低49.24%。