Many systems and services rely on timing assumptions for performance and availability to perform critical aspects of their operation, such as various timeouts for failure detectors or optimizations to concurrency control mechanisms. Many such assumptions rely on the ability of different components to communicate on time -- a delay in communication may trigger the failure detector or cause the system to enter a less-optimized execution mode. Unfortunately, these timing assumptions are often set with little regard to actual communication guarantees of the underlying infrastructure -- in particular, the variability of communication delays between processes in different nodes/servers. The higher communication variability holds especially true for systems deployed in the public cloud since the cloud is a utility shared by many users and organizations, making it prone to higher performance variance due to noisy neighbor syndrome. In this work, we present Cloud Latency Tester (CLT), a simple tool that can help measure the variability of communication delays between nodes to help engineers set proper values for their timing assumptions. We also provide our observational analysis of running CLT in three major cloud providers and share the lessons we learned.
翻译:许多系统和服务的性能与可用性依赖于时间假设来执行其关键操作,例如故障检测器的各种超时机制或并发控制机制的优化。许多此类假设依赖于不同组件之间的及时通信能力——通信延迟可能触发故障检测器,或导致系统进入非最优化的执行模式。然而,这些时间假设的设置往往很少考虑底层基础设施的实际通信保障——特别是不同节点/服务器上进程间通信延迟的变异性。这种较高的通信变异性尤其适用于部署在公共云中的系统,因为云作为一种被众多用户和组织共享的公用设施,由于噪声邻居综合征,其性能波动更为显著。本研究提出了云延迟测试工具(CLT),一种能够帮助工程师测量节点间通信延迟变异的简易工具,以协助其设定合理的时间假设值。我们还提供了在三大主流云提供商中运行CLT的观测分析结果,并分享了从中获得的经验教训。