We investigate how to efficiently compute the difference result of two (or multiple) conjunctive queries, which is the last operator in relational algebra to be unraveled. The standard approach in practical database systems is to materialize the results for every input query as a separate set, and then compute the difference of two (or multiple) sets. This approach is bottlenecked by the complexity of evaluating every input query individually, which could be very expensive, particularly when there are only a few results in the difference. In this paper, we introduce a new approach by exploiting the structural property of input queries and rewriting the original query by pushing the difference operator down as much as possible. We show that for a large class of difference queries, this approach can lead to a linear-time algorithm, in terms of the input size and (final) output size, i.e., the number of query results that survive from the difference operator. We complete this result by showing the hardness of computing the remaining difference queries in linear time. Although a linear-time algorithm is hard to achieve in general, we also provide some heuristics that can provably improve the standard approach. At last, we compare our approach with standard SQL engines over graph and benchmark datasets. The experiment results demonstrate order-of-magnitude speedups achieved by our approach over the vanilla SQL.
翻译:我们研究如何高效计算两个(或多个)合取查询的差集结果,这是关系代数中最后一个未被充分探明的算子。实际数据库系统的标准做法是将每个输入查询的结果物化为独立集合,再计算这些集合的差集。该方法的瓶颈在于需逐个评估每个输入查询的复杂度,当差集结果仅有少量时,这一过程可能代价高昂。本文提出一种新方法:通过利用输入查询的结构特性,将差集算子尽可能下推以重写原始查询。研究表明,对一大类差集查询,该方法可实现关于输入规模和(最终)输出规模(即差集算子保留的查询结果数量)的线性时间算法。我们通过证明其余差集查询无法在线性时间内计算来完善这一结论。尽管通用线性时间算法难以实现,我们仍提出了一些可证明优于标准方法的启发式策略。最后,我们在图和基准数据集上,将我们的方法与标准SQL引擎进行对比。实验结果显示,相比原生SQL,我们的方法可实现数量级的加速。