A key need in different disciplines is to perform analytics over fast-paced data streams, similar in nature to the traditional OLAP analytics in relational databases i.e., with filters and aggregates. Storing unbounded streams, however, is not a realistic, or desired approach due to the high storage requirements, and the delays introduced when storing massive data. Accordingly, many synopses/sketches have been proposed that can summarize the stream in small memory (usually sufficiently small to be stored in RAM), such that aggregate queries can be efficiently approximated, without storing the full stream. However, past synopses predominantly focus on summarizing single-attribute streams, and cannot handle filters and constraints on arbitrary subsets of multiple attributes efficiently. In this work, we propose OmniSketch, the first sketch that scales to fast-paced and complex data streams (with many attributes), and supports aggregates with filters on multiple attributes, dynamically chosen at query time. The sketch offers probabilistic guarantees, a favorable space-accuracy tradeoff, and a worst-case logarithmic complexity for updating and for query execution. We demonstrate experimentally with both real and synthetic data that the sketch outperforms the state-of-the-art, and that it can approximate complex ad-hoc queries within the configured accuracy guarantees, with small memory requirements.
翻译:不同领域的一个关键需求是对高速数据流进行分析,其性质类似于关系数据库中传统的OLAP分析,即包含过滤和聚合操作。然而,由于存储海量数据的高存储需求以及引入的延迟,存储无界流并非现实或理想的方法。因此,许多概要/草图已被提出,它们能够在少量内存(通常小到足以存储在RAM中)中总结数据流,从而在无需存储完整流的情况下高效近似聚合查询。然而,过去的概要主要侧重于总结单属性流,无法高效处理任意多属性子集上的过滤和约束。在本文中,我们提出了OmniSketch,这是首个能够扩展到高速复杂数据流(包含众多属性)并支持查询时动态选择的、对多属性进行过滤的聚合的草图。该草图提供了概率保证、有利的空间-精度权衡,以及更新和查询执行的最坏情况对数复杂度。我们通过真实和合成数据进行的实验表明,该草图优于现有技术,并且能够在配置的精度保证下以较小的内存需求近似复杂的即席查询。