Stream processing is extensively used in the IoT-to-Cloud spectrum to distill information from continuous streams of data. Streaming applications usually run in dedicated Stream Processing Engines (SPEs) that adopt the DataFlow model, which defines such applications as graphs of operators that, step by step, transform data into the desired results. As operators can be deployed and executed independently, the DataFlow model supports parallelism and distribution, thus making streaming applications scalable. Today, we witness an abundance of SPEs, each with its set of operators. In this context, understanding how operators' semantics overlap within and across SPEs, and thus which SPEs can support a given application, is not trivial. We tackle this problem by formally showing that common operators of SPEs can be expressed as compositions of a single, minimalistic Aggregate operator, thus showing any framework able to run compositions of such an operator can run applications defined for state-of-the-art SPEs. The Aggregate operator only relies on core concepts of the DataFlow model such as data partitioning by key and time-based windows, and can only output up to one value for each window it analyzes. Together with our formal argumentation, we empirically assess how an SPE that only relies on such an operator compares with an SPE offering operator-specific implementations, as well as study the performance impact of a more expressive Aggregate operator by relaxing the constraint of outputting up to one value per window. The existence of such a common denominator not only implies the portability of operators within and across SPEs but also defines a concise set of requirements for other data processing frameworks to support streaming applications.
翻译:流处理被广泛应用于物联网至云计算的连续数据流信息提取场景。流处理应用通常运行在采用DataFlow模型的专用流处理引擎中,该模型将应用定义为操作符图,通过逐步转换数据生成期望结果。由于操作符可独立部署与执行,DataFlow模型支持并行与分布式处理,从而使流处理应用具备可扩展性。当前存在大量流处理引擎,每种引擎都拥有独特的操作符集合。在此背景下,理解操作符语义在引擎内外的重叠关系,进而判断特定引擎能否支持给定应用,并非易事。我们通过形式化证明表明,流处理引擎的常见操作符可表示为单一精简聚合操作符的组合,因此任何能够运行该操作符组合的框架均可支持当前主流流处理引擎定义的应用。该聚合操作符仅依赖DataFlow模型的核心概念(如基于键和时间窗口的数据分区),且每个分析窗口最多输出一个值。结合形式化论证,我们通过实验评估了仅依赖该操作符的引擎与提供专用操作符实现引擎的性能差异,并通过放宽"每个窗口仅输出一个值"的约束条件,研究了更具表达能力的聚合操作符对性能的影响。这种通用操作符的存在不仅意味着操作符在引擎内外具有可移植性,更为其他数据处理框架支持流处理应用定义了简洁的需求集合。