Stream processing has become a critical component in the architecture of modern applications. With the exponential growth of data generation from sources such as the Internet of Things, business intelligence, and telecommunications, real-time processing of unbounded data streams has become a necessity. DSP systems provide a solution to this challenge, offering high horizontal scalability, fault-tolerant execution, and the ability to process data streams from multiple sources in a single DSP job. Often enough though, data streams need to be enriched with extra information for correct processing, which introduces additional dependencies and potential bottlenecks. In this paper, we present an in-depth evaluation of data enrichment methods for DSP systems and identify the different use cases for stream processing in modern systems. Using a representative DSP system and conducting the evaluation in a realistic cloud environment, we found that outsourcing enrichment data to the DSP system can improve performance for specific use cases. However, this increased resource consumption highlights the need for stream processing solutions specifically designed for the performance-intensive workloads of cloud-based applications.
翻译:流处理已成为现代应用架构中的关键组成部分。随着物联网、商业智能和电信等领域数据生成的指数级增长,对无界数据流进行实时处理已成为必要需求。分布式流处理系统为这一挑战提供了解决方案,具有高横向可扩展性、容错执行能力以及可在单个DSP作业中处理来自多个来源数据流的能力。然而,数据流通常需要扩充额外信息才能正确处理,这引入了额外的依赖关系和潜在瓶颈。本文对DSP系统的数据扩充方法进行了深入评估,并识别了现代系统中流处理的不同使用场景。通过使用代表性DSP系统并在真实云环境中开展评估,我们发现将扩充数据外包给DSP系统可提升特定使用场景的性能。但这种资源消耗的增加也凸显了需要专门为基于云的应用的高性能工作负载设计流处理解决方案的必要性。