Stream processing is usually done either on a tuple-by-tuple basis or in micro-batches. There are many applications where tuples over a predefined duration/window must be processed within certain deadlines. Processing such queries using stream processing engines can be very inefficient since there is often a significant overhead per tuple or micro-batch. The cost of computation can be significantly reduced by using the wider window available for computation. In this work, we present scheduling schemes where the overhead cost is minimized while meeting the query deadline constraints. For such queries, since the result is needed only at the deadline, tuples can be processed in larger batches, instead of using micro-batches. We present scheduling schemes for single and multi query scenarios. The proposed scheduling algorithms have been implemented as a Custom Query Scheduler, on top of Apache Spark. Our performance study with TPC-H data, under single and multi query modes, shows orders of magnitude improvement as compared to naively using Spark streaming.
翻译:流处理通常以逐元组或微批次方式进行。在许多应用中,需要在一定时限内处理预定义时间窗口内的元组。使用流处理引擎处理此类查询效率可能极低,因为每个元组或微批次通常会产生显著开销。通过利用更宽的计算窗口可大幅降低计算成本。本研究提出了一种调度方案,在满足查询截止时间约束的同时最小化开销成本。对于此类查询,由于仅在截止时间需要结果,因此可采用更大批次处理元组,而非使用微批次。我们针对单查询与多查询场景分别设计了调度方案。所提出的调度算法已在Apache Spark上实现为自定义查询调度器。基于TPC-H数据集的性能研究表明,在单查询与多查询模式下,本方案相比直接使用Spark流处理可实现数量级的性能提升。