Many applications process a stream of tuples over a window duration, and require the results within a specified deadline after the end of the window. For such scenarios, processing tuples intermittently (in batches) instead of eagerly processing tuples as they arrive significantly reduces the overall cost. Earlier work on intermittent query processing has addressed only fixed environments. In this paper, we propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost. Our schemes also handle multiple concurrent queries, the arrival of new queries, and input rate variations. We have implemented our schemes on top of Apache Spark, in the AWS EMR environment, and evaluated performance with both TPC-H and Yahoo Streaming datasets. Our experimental results show that our scheduling algorithms significantly outperform alternatives, such as using a fixed set of nodes without elasticity, or using Spark streaming.
翻译:[摘要] 许多应用需在滑动窗口时长内处理数据流元组,并要求在窗口结束后指定截止时间内返回结果。针对此类场景,采用间歇性(批量)处理而非实时处理元组可显著降低总体成本。早期的间歇性查询处理研究仅针对固定环境。本文在弹性并行环境中提出元组批处理的调度方案,支持节点动态扩缩容。本方案在满足截止时间要求的同时实现成本最小化,并能处理多并发查询、新增查询以及输入速率变化等问题。我们在AWS EMR环境中基于Apache Spark实现了所提方案,并使用TPC-H与Yahoo Streaming数据集进行评估。实验结果表明:相较于固定节点无弹性方案或Spark Streaming等替代方案,我们的调度算法具有显著性能优势。