Stream processing has become a critical component in the architecture of modern applications. With the exponential growth of data generation from sources such as the Internet of Things, business intelligence, and telecommunications, real-time processing of unbounded data streams has become a necessity. DSP systems provide a solution to this challenge, offering high horizontal scalability, fault-tolerant execution, and the ability to process data streams from multiple sources in a single DSP job. Often enough though, data streams need to be enriched with extra information for correct processing, which introduces additional dependencies and potential bottlenecks. In this paper, we present an in-depth evaluation of data enrichment methods for DSP systems and identify the different use cases for stream processing in modern systems. Using a representative DSP system and conducting the evaluation in a realistic cloud environment, we found that outsourcing enrichment data to the DSP system can improve performance for specific use cases. However, this increased resource consumption highlights the need for stream processing solutions specifically designed for the performance-intensive workloads of cloud-based applications.
翻译:流处理已成为现代应用架构中的关键组成部分。随着物联网、商业智能和电信等来源的数据生成呈指数级增长,对无界数据流进行实时处理已成为必要需求。分布式流处理系统为此挑战提供了解决方案,具备高度水平扩展性、容错执行能力,以及可在单个DSP作业中处理来自多源数据流的能力。然而,数据流常需通过额外信息进行增强以实现正确处理,这引入了额外的依赖关系和潜在瓶颈。本文对分布式流处理系统的数据增强方法进行了深度评估,并识别了现代系统中流处理的不同使用场景。通过使用具有代表性的DSP系统并在真实云环境中进行评估,我们发现将增强数据外源至DSP系统可在特定使用场景中提升性能。但由此产生的资源消耗增加,突显出需要专门为基于云应用的高性能工作负载设计流处理解决方案。