We investigate how to efficiently compute the difference result of two (or multiple) conjunctive queries, which is the last operator in relational algebra to be unraveled. The standard approach in practical database systems is to materialize the results for every input query as a separate set, and then compute the difference of two (or multiple) sets. This approach is bottlenecked by the complexity of evaluating every input query individually, which could be very expensive, particularly when there are only a few results in the difference. In this paper, we introduce a new approach by exploiting the structural property of input queries and rewriting the original query by pushing the difference operator down as much as possible. We show that for a large class of difference queries, this approach can lead to a linear-time algorithm, in terms of the input size and (final) output size, i.e., the number of query results that survive from the difference operator. We complete this result by showing the hardness of computing the remaining difference queries in linear time. Although a linear-time algorithm is hard to achieve in general, we also provide some heuristics that can provably improve the standard approach. At last, we compare our approach with standard SQL engines over graph and benchmark datasets. The experiment results demonstrate order-of-magnitude speedups achieved by our approach over the vanilla SQL.
翻译:我们研究如何高效计算两个(或多个)合取查询的差集结果——这是关系代数中最后一个尚未被充分解析的运算符。在实际数据库系统中,标准方法是将每个输入查询的结果物化为独立集合,再计算这些集合的差集。该方法的瓶颈在于需单独评估每个输入查询的复杂度,当差集结果较少时,这种评估可能极其昂贵。本文提出一种新方法,通过利用输入查询的结构性质,将差集运算符尽可能向下推入原查询进行重写。我们证明,对于一大类差集查询,该方法可实现关于输入规模和(最终)输出规模(即差集运算符保留的查询结果数量)的线性时间复杂度。同时,我们通过证明其余差集查询难以在线性时间内计算,完善了这一结论。尽管实现通用线性时间算法存在困难,我们仍提供了若干可证明优化标准方法的启发式策略。最后,我们在图数据集与基准数据集上,将所提方法与标准SQL引擎进行对比实验。结果表明,我们的方法相比原生SQL实现了数量级的性能提升。