We propose uBFT, the first State-Machine Replication (SMR) system to achieve microsecond-scale latency in data centers, while using only $2f{+}1$ replicas to tolerate $f$ Byzantine failures. The Byzantine Fault Tolerance (BFT) provided by uBFT is essential as pure crashes appear to be a mere illusion with real-life systems reportedly failing in many unexpected ways. uBFT relies on a small non-tailored trusted computing base -- disaggregated memory -- and consumes a practically bounded amount of memory (both local and disaggregated). uBFT is based on a novel abstraction called Consistent Tail Broadcast, which we use to prevent equivocation while bounding memory. We implement uBFT using RDMA-based disaggregated memory and obtain an end-to-end latency of as little as 10us. This is at least 50$\times$ faster than MinBFT , a state of the art $2f{+}1$ BFT SMR based on Intel's SGX. We use uBFT to replicate two key-value stores (Memcached and Redis), as well as a financial order matching engine (Liquibook). These applications have low latency (up to 20us) and become Byzantine tolerant with as little as 10us more. The price for uBFT is a small amount of reliable disaggregated memory (less than 1 MiB), which in our prototype consists of a small number of memory servers connected through RDMA and replicated for fault tolerance.
翻译:本文提出uBFT,这是首个在数据中心环境下实现微秒级延迟的状态机复制系统,且仅需$2f{+}1$个副本即可容忍$f$个拜占庭故障。uBFT提供的拜占庭容错能力至关重要,因为在实际系统中纯粹的崩溃故障似乎只是一种假象,据报道系统会以多种不可预期的方式失效。uBFT依赖于一个非定制的小型可信计算基——解耦内存,并消耗实际有限的内存资源(包括本地内存与解耦内存)。该系统基于一种称为一致性尾部广播的创新抽象机制,该机制用于在限制内存使用的同时防止歧义性陈述。我们基于RDMA解耦内存实现了uBFT原型,获得了最低10微秒的端到端延迟。这比当前最先进的基于英特尔SGX技术的$2f{+}1$拜占庭容错状态机复制系统MinBFT至少快50倍。我们使用uBFT复制了两个键值存储系统(Memcached和Redis)以及一个金融订单匹配引擎(Liquibook)。这些应用本身具有低延迟特性(最高20微秒),在仅增加10微秒延迟的情况下即可获得拜占庭容错能力。uBFT的代价是需要少量可靠的解耦内存(小于1 MiB),在我们的原型中由少量通过RDMA连接并进行容错复制的内存服务器提供。