Exponential growth in data collection is creating significant challenges for data storage and analytics latency.Approximate Query Processing (AQP) has long been touted as a solution for accelerating analytics on large datasets, however, there is still room for improvement across all key performance criteria. In this paper, we propose a novel histogram-based data synopsis called PairwiseHist that uses recursive hypothesis testing to ensure accurate histograms and can be built on top of data compressed using Generalized Deduplication (GD). We thus show that GD data compression can contribute to AQP. Compared to state-of-the-art AQP approaches, PairwiseHist achieves better performance across all key metrics, including 2.6$ \times $ higher accuracy, 3.5$ \times $ lower latency, 24$ \times $ smaller synopses and 1.5--4$ \times $ faster construction time.
翻译:数据采集的指数级增长给数据存储和分析延迟带来了重大挑战。近似查询处理长期以来被视为加速大数据集分析的解决方案,然而,在所有关键性能指标上仍有改进空间。本文提出一种名为PairwiseHist的新型直方图数据概要方法,该方法利用递归假设检验确保直方图的精确性,并可基于广义去重压缩后的数据构建。我们由此证明GD数据压缩能够助力AQP。与最先进的AQP方法相比,PairwiseHist在所有关键指标上均实现了更优性能,包括精度提升2.6倍、延迟降低3.5倍、概要体积缩小24倍以及构建速度提升1.5-4倍。