Sliding-window aggregation is a foundational stream processing primitive that efficiently summarizes recent data. The state-of-the-art algorithms for sliding-window aggregation are highly efficient when stream data items are evicted or inserted one at a time, even when some of the insertions occur out-of-order. However, real-world streams are often not only out-of-order but also burtsy, causing data items to be evicted or inserted in larger bulks. This paper introduces a new algorithm for sliding-window aggregation with bulk eviction and bulk insertion. For the special case of single insert and evict, our algorithm matches the theoretical complexity of the best previous out-of-order algorithms. For the case of bulk evict, our algorithm improves upon the theoretical complexity of the best previous algorithm for that case and also outperforms it in practice. For the case of bulk insert, there are no prior algorithms, and our algorithm improves upon the naive approach of emulating bulk insert with a loop over single inserts, both in theory and in practice. Overall, this paper makes high-performance algorithms for sliding window aggregation more broadly applicable by efficiently handling the ubiquitous cases of out-of-order data and bursts.
翻译:滑动窗口聚合是一种基础性的流处理原语,能够高效地汇总近期数据。当流数据条目以逐个方式淘汰或插入时(即使部分插入存在乱序),最先进的滑动窗口聚合算法具有极高的效率。然而,现实中的流数据不仅存在乱序,而且往往具有突发性,导致数据条目以较大批量形式被淘汰或插入。本文提出了一种支持批量淘汰与批量插入的新型滑动窗口聚合算法。对于单条目插入与淘汰的特殊情况,该算法与已有最优乱序算法的理论复杂度相匹配。对于批量淘汰场景,该算法在理论复杂度上优于该场景下已有最优算法,且在实践中表现更佳。对于批量插入场景,目前尚无先验算法,而本文算法在理论与实践上均优于通过循环单次插入模拟批量插入的朴素方法。总体而言,本文通过高效处理乱序数据与突发流这两种普遍存在的场景,使高性能滑动窗口聚合算法具备更广泛的适用性。