Recent advancements in data stream processing frameworks have improved real-time data handling, however, scalability remains a significant challenge affecting throughput and latency. While studies have explored this issue on local machines and cloud clusters, research on modern high performance computing (HPC) infrastructures is yet limited due to the lack of scalable measurement tools. This work presents SProBench, a novel benchmark suite designed to evaluate the performance of data stream processing frameworks in large-scale computing systems. Building on best practices, SProBench incorporates a modular architecture, offers native support for SLURM-based clusters, and seamlessly integrates with popular stream processing frameworks such as Apache Flink, Apache Spark Streaming, and Apache Kafka Streams. Experiments conducted on HPC clusters demonstrate its exceptional scalability, delivering throughput that surpasses existing benchmarks by more than tenfold. The distinctive features of SProBench, including complete customization options, built-in automated experiment management tools, seamless interoperability, and an open-source license, distinguish it as an innovative benchmark suite tailored to meet the needs of modern data stream processing frameworks.
翻译:近期数据流处理框架的进展提升了实时数据处理能力,然而可扩展性仍是影响吞吐量与延迟的关键挑战。尽管已有研究在本地机器与云集群上探讨了该问题,但由于缺乏可扩展的测量工具,针对现代高性能计算(HPC)基础设施的研究仍显不足。本文提出SProBench——一个专为评估大规模计算系统中数据流处理框架性能而设计的新型基准测试套件。该套件基于最佳实践构建,采用模块化架构,原生支持基于SLURM的集群,并能与Apache Flink、Apache Spark Streaming及Apache Kafka Streams等主流流处理框架无缝集成。在HPC集群上进行的实验表明,该工具具有卓越的可扩展性,其提供的吞吐量超越现有基准测试工具十倍以上。SProBench的突出特性包括:完整的定制化选项、内置自动化实验管理工具、无缝互操作性以及开源许可,这些特点使其成为一款专为满足现代数据流处理框架需求而设计的创新基准测试套件。