With the increasing rate of data generated by critical systems, estimating functions on streaming data has become essential. This demand has driven numerous advancements in algorithms designed to efficiently query and analyze one or more data streams while operating under memory constraints. The primary challenge arises from the rapid influx of new items, requiring algorithms that enable efficient incremental processing of streams in order to keep up. A prominent algorithm in this domain is the AMS sketch. Originally developed to estimate the second frequency moment of a data stream, it can also estimate the cardinality of the equi-join between two relations. Since then, two important advancements are the Count sketch, a method which significantly improves upon the sketch update time, and secondly, an extension of the AMS sketch to accommodate multi-join queries. However, combining the strengths of these methods to maintain sketches for multi-join queries while ensuring fast update times is a non-trivial task, and has remained an open problem for decades as highlighted in the existing literature. In this work, we successfully address this problem by introducing a novel sketching method which has fast updates, even for sketches capable of accurately estimating the cardinality of complex multi-join queries. We prove that our estimator is unbiased and has the same error guarantees as the AMS-based method. Our experimental results confirm the significant improvement in update time complexity, resulting in orders of magnitude faster estimates, with equal or better estimation accuracy.
翻译:随着关键系统生成数据速率的持续增长,对流式数据的函数估计已成为必要需求。这一需求推动了许多算法的发展,旨在内存受限条件下高效查询与分析一个或多个数据流。核心挑战在于新数据项的快速涌入,要求算法能够对数据流进行高效增量处理以保持同步。该领域的代表性算法是AMS草图。该算法最初用于估计数据流的二阶频率矩,同时可估计两个关系等值连接的基数。此后,两项重要进展相继出现:一是计数草图方法,显著提升了草图更新速度;二是将AMS草图扩展至多连接查询场景。然而,如何融合这些方法的优势,在保证快速更新速度的同时维护多连接查询的草图,始终是一项非平凡任务,正如现有文献所述,这已成为数十年未决的开放问题。本研究通过引入新型草图方法成功解决了该问题——即便面对能准确估计复杂多连接查询基数的草图,该方法仍能实现快速更新。我们证明了该估计器具有无偏性,且与基于AMS的方法具有相同的误差保证。实验结果证实了更新时间复杂度的大幅改善,在保持同等或更优估计精度的情况下,估计速度提升数个数量级。