CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows

Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtually all stream processors that guarantee exactly-once processing implement a variant of Apache Flink's coordinated checkpoints - an extension of the original Chandy-Lamport checkpoints from 1985. However, the reasons behind this prevalence of the coordinated approach remain anecdotal, as reported by practitioners of the stream processing community. At the same time, common checkpointing approaches, such as the uncoordinated and the communication-induced ones, remain largely unexplored. This paper is the first to address this gap by i) shedding light on why practitioners have favored the coordinated approach and ii) by investigating whether there are viable alternatives. To this end, we implement three checkpointing approaches that we surveyed and adapted for the distinct needs of streaming dataflows. Our analysis shows that the coordinated approach outperforms the uncoordinated and communication-induced protocols under uniformly distributed workloads. To our surprise, however, the uncoordinated approach is not only competitive to the coordinated one in uniformly distributed workloads, but it also outperforms the coordinated approach in skewed workloads. We conclude that rather than blindly employing coordinated checkpointing, research should focus on optimizing the very promising uncoordinated approach, as it can address issues with skew and support prevalent cyclic queries. We believe that our findings can trigger further research into checkpointing mechanisms.

翻译：摘要：过去十年中，流处理在商业和研究领域均得到了广泛应用。这一成功的关键要素在于现代流处理器能够处理故障，同时保证恰好一次处理语义。截至撰写本文时，几乎所有保证恰好一次处理的流处理器都实现了Apache Flink协调检查点的变体——这是对1985年Chandy-Lamport原始检查点机制的扩展。然而，根据流处理社区从业者的报告，协调方法为何如此盛行仍停留在经验性讨论层面。与此同时，无协调和通信引发的检查点等常见方案在很大程度上尚未被深入探索。本文首次填补了这一空白：i）阐明从业者青睐协调方法的原因，ii）探究是否存在可行的替代方案。为此，我们实现了三种经调研并针对流数据流特殊需求改造的检查点方案。分析表明，在均匀分布工作负载下，协调方法优于无协调和通信引发协议。但令人惊讶的是，无协调方法不仅在均匀分布工作负载中与协调方法旗鼓相当，在偏斜工作负载下表现更胜一筹。我们认为，与其盲目采用协调检查点，研究应聚焦于极具潜力的无协调方法优化——该方法既能解决数据偏斜问题，又可支持主流的循环查询。我们相信，这一发现将推动检查点机制的进一步研究。