This paper presents our experience to understand latency variance caused by kernel and hardware events, which are often invisible at the application level. For this purpose, we have built VarMRI, a tool chain to monitor and analyze those events in the long term. To mitigate the "big data" problem caused by long-term monitoring, VarMRI selectively records a subset of events following two principles: it only records events that are affecting the requests recorded by the application; it records coarse-grained information first and records additional information only when necessary. Furthermore, VarMRI introduces an analysis method that is efficient on large amount of data, robust on different data set and against missing data, and informative to the user. VarMRI has helped us to carry out a 3,000-hour study of six applications and benchmarks on CloudLab. It reveals a wide variety of events causing latency variance, including interrupt preemption, Java GC, pipeline stall, NUMA balancing etc.; simple optimization or tuning can reduce tail latencies by up to 31%. Furthermore, the impacts of some of these events vary significantly across different experiments, which confirms the necessity of long-term monitoring.
翻译:本文介绍了我们为理解由内核与硬件事件引起的延迟波动所积累的经验,这些事件通常在应用层面不可见。为此,我们构建了VarMRI——一个用于长期监测与分析这些事件的工具链。为缓解长期监测产生的“大数据”问题,VarMRI遵循两项原则选择性记录事件子集:仅记录影响应用所记录请求的事件;首先记录粗粒度信息,仅在必要时记录额外信息。此外,VarMRI引入了一种分析方法,该方法能高效处理海量数据,对不同数据集及数据缺失情况具有鲁棒性,并能向用户提供有效信息。借助VarMRI,我们在CloudLab上对六个应用与基准测试开展了为期3000小时的研究。研究揭示了导致延迟波动的多种事件,包括中断抢占、Java垃圾回收、流水线阻塞、NUMA平衡等;简单的优化或调参可使尾部延迟降低达31%。值得注意的是,部分事件的影响在不同实验间存在显著差异,这证实了长期监测的必要性。