Many applications process a stream of tuples over a window duration, and require the results within a specified deadline after the end of the window. For such scenarios, processing tuples intermittently (in batches) instead of eagerly processing tuples as they arrive significantly reduces the overall cost. Earlier work on intermittent query processing has addressed only fixed environments. In this paper, we propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost. Our schemes also handle multiple concurrent queries, the arrival of new queries, and input rate variations. We have implemented our schemes on top of Apache Spark, in the AWS EMR environment, and evaluated performance with both TPC-H and Yahoo Streaming datasets. Our experimental results show that our scheduling algorithms significantly outperform alternatives, such as using a fixed set of nodes without elasticity, or using Spark streaming.
翻译:许多应用程序在窗口持续时间内处理元组流,并需要在窗口结束后指定的截止时间前获得结果。在此类场景中,以间歇性方式(批处理)处理元组,而非在元组到达时立即处理,可显著降低整体成本。早期关于间歇性查询处理的研究仅针对固定环境。本文提出了在弹性并行环境中对元组进行批处理的调度方案,支持节点的动态扩缩容。我们的调度方案在确保满足截止时间的同时,将成本降至最低。该方案还能处理多并发查询、新查询的到达以及输入速率的变化。我们已在AWS EMR环境下基于Apache Spark实现了所提出的方案,并使用TPC-H和Yahoo流数据集评估了性能。实验结果表明,我们的调度算法显著优于使用固定无弹性节点集或Spark流处理等替代方案。