High-level Stream Processing: A Complementary Analysis of Fault Recovery

Parallel computing is very important to accelerate the performance of software systems. Additionally, considering that a recurring challenge is to process high data volumes continuously, stream processing emerged as a paradigm and software architectural style. Several software systems rely on stream processing to deliver scalable performance, whereas open-source frameworks provide coding abstraction and high-level parallel computing. Although stream processing's performance is being extensively studied, the measurement of fault tolerance--a key abstraction offered by stream processing frameworks--has still not been adequately measured with comprehensive testbeds. In this work, we extend the previous fault recovery measurements with an exploratory analysis of the configuration space, additional experimental measurements, and analysis of improvement opportunities. We focus on robust deployment setups inspired by requirements for near real-time analytics of a large cloud observability platform. The results indicate significant potential for improving fault recovery and performance. However, these improvements entail grappling with configuration complexities, particularly in identifying and selecting the configurations to be fine-tuned and determining the appropriate values for them. Therefore, new abstractions for transparent configuration tuning are also needed for large-scale industry setups. We believe that more software engineering efforts are needed to provide insights into potential abstractions and how to achieve them. The stream processing community and industry practitioners could also benefit from more interactions with the high-level parallel programming community, whose expertise and insights on making parallel programming more productive and efficient could be extended.

翻译：并行计算对于加速软件系统的性能至关重要。此外，考虑到持续处理大量数据是一个反复出现的挑战，流处理作为一种范式和软件架构风格应运而生。多种软件系统依赖流处理来实现可扩展的性能，而开源框架则提供了编码抽象和高级并行计算能力。尽管流处理的性能已被广泛研究，但作为流处理框架提供的关键抽象之一，容错能力的测量尚未通过全面的测试平台得到充分评估。在本工作中，我们通过探索性分析配置空间、补充实验测量以及分析改进机会，扩展了先前的故障恢复测量。我们专注于基于大型云观测平台近实时分析需求而设计的稳健部署配置。结果表明，故障恢复和性能提升具有显著潜力。然而，这些改进需要应对配置复杂性，特别是在识别和选择待优化的配置以及确定它们的适当值时尤为突出。因此，对于大规模工业场景，还需开发用于透明配置调优的新抽象。我们认为，需要更多的软件工程努力来揭示潜在抽象以及如何实现它们。流处理社区和行业从业者还可以从与高级并行编程社区的更多互动中受益，后者的专业知识和见解可以扩展，使并行编程更具生产力和效率。