State Machine Replication (SMR) protocols form the backbone of many distributed systems. Enterprises and startups increasingly build their distributed systems on the cloud due to its many advantages, such as scalability and cost-effectiveness. One of the first technical questions companies face when building a system on the cloud is which programming language to use. Among many factors that go into this decision is whether to use a language with garbage collection (GC), such as Java or Go, or a language with manual memory management, such as C++ or Rust. Today, companies predominantly prefer languages with GC, like Go, Kotlin, or even Python, due to ease of development; however, there is no free lunch: GC costs resources (memory and CPU) and performance (long tail latencies due to GC pauses). While there have been anecdotal reports of reduced cloud cost and improved tail latencies when switching from a language with GC to a language with manual memory management, so far, there has not been a systematic study of the GC overhead of running an SMR-based cloud system. This paper studies the overhead of running an SMR-based cloud system written in a language with GC. To this end, we design from scratch a canonical SMR system -- a MultiPaxos-based replicated in-memory key-value store -- and we implement it in C++, Java, Rust, and Go. We compare the performance and resource usage of these implementations when running on the cloud under different workloads and resource constraints and report our results. Our findings have implications for the design of cloud systems.
翻译:状态机复制(SMR)协议构成了许多分布式系统的基石。由于可扩展性和成本效益等诸多优势,企业和初创公司越来越多地在云上构建分布式系统。公司在云上构建系统时面临的首要技术问题之一,便是选择何种编程语言。决定因素众多,其中之一是是否使用具备垃圾回收(GC)功能的语言(如Java或Go),还是采用手动内存管理的语言(如C++或Rust)。目前,由于开发便捷,公司普遍偏好使用具备GC功能的语言,例如Go、Kotlin甚至Python;然而,天下没有免费的午餐:GC会消耗资源(内存和CPU)并影响性能(因GC暂停导致的尾延迟)。尽管有传闻称,从使用GC的语言切换到手动内存管理的语言能降低云成本并改善尾延迟,但迄今为止,尚缺乏对基于SMR的云系统运行时GC开销的系统性研究。本文研究了在采用具备GC功能的语言编写的基于SMR的云系统上运行的开销。为此,我们从头设计了一个规范的SMR系统——基于MultiPaxos的复制式内存键值存储——并使用C++、Java、Rust和Go分别实现。我们比较了这些实现在云上不同工作负载和资源约束下的性能与资源使用情况,并报告了实验结果。我们的发现对云系统的设计具有启示意义。