Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Furthermore, since the ML approaches model the data, they fail to capitalize on any query specific information to improve performance in practice. In this paper, we focus on modeling ``queries'' rather than data and train neural networks to learn the query answers. This change of focus allows us to theoretically study our ML approach to provide a distribution and query dependent error bound for neural networks when answering RAQs. We confirm our theoretical results by developing NeuroSketch, a neural network framework to answer RAQs in practice. Extensive experimental study on real-world, TPC-benchmark and synthetic datasets show that NeuroSketch answers RAQs multiple orders of magnitude faster than state-of-the-art and with better accuracy.
翻译:范围聚合查询(RAQ)是众多实际应用中的核心组成部分,此类应用通常需要快速且近似准确的查询结果。近期研究探索了利用机器学习模型回答RAQ的方法,即通过学习数据模型来解答查询。然而,目前缺乏对基于机器学习方法为何及何时表现良好的理论理解。此外,由于机器学习方法是对数据进行建模,因此在实际应用中无法利用查询特定信息来提升性能。本文聚焦于对"查询"而非数据进行建模,并训练神经网络学习查询答案。这一视角转变使我们能够从理论上研究该机器学习方法,为神经网络回答RAQ时提供一种依赖于数据分布与查询特性的误差界。我们通过开发NeuroSketch(一种实际应用于回答RAQ的神经网络框架)验证了理论结果。在真实数据集、TPC基准测试数据集及合成数据集上的大量实验表明,NeuroSketch回答RAQ的速度比最先进方法快数个数量级,且具有更优的准确性。