Efficient data streaming is essential for real-time data analytics, visualization, and machine learning model training, particularly when dealing with high-volume datasets. Various streaming technologies and serialization protocols have been developed to cater to different streaming requirements, each performing differently depending on specific tasks and datasets involved. This variety poses challenges in selecting the most appropriate combination, as encountered during the implementation of streaming system for the MAST fusion device data or SKA's radio astronomy data. To address this challenge, we conducted an empirical study on widely used data streaming technologies and serialization protocols. We also developed an extensible, open-source software framework to benchmark their efficiency across various performance metrics. Our study uncovers significant performance differences and trade-offs between these technologies, providing valuable insights that can guide the selection of optimal streaming and serialization solutions for modern data-intensive applications. Our goal is to equip the scientific community and industry professionals with the knowledge needed to enhance data streaming efficiency for improved data utilization and real-time analysis.
翻译:高效的数据流处理对于实时数据分析、可视化以及机器学习模型训练至关重要,尤其是在处理高容量数据集时。为满足不同的流式需求,业界已开发出多种流式技术与序列化协议,其性能表现因具体任务和数据集而异。这种多样性给选择最合适的组合带来了挑战,正如在实现MAST聚变装置数据或SKA射电天文数据的流式系统时所遇到的问题。为应对这一挑战,我们对广泛使用的数据流式技术与序列化协议进行了实证研究,并开发了一个可扩展的开源软件框架,以在多类性能指标下对它们的效率进行基准测试。我们的研究揭示了这些技术之间显著的性能差异与权衡,为现代数据密集型应用选择最优流式与序列化方案提供了重要参考。我们的目标是为科学界与行业从业者提供提升数据流处理效率所需的知识,以优化数据利用与实时分析能力。