Improving data systems' performance for join operations has long been an issue of great importance. More recently, a lot of focus has been devoted to multi-way join performance and especially on reducing the negative impact of producing intermediate tuples, which in the end do not make it in the final result. We contribute a new multi-way join algorithm, coined SieveJoin, which extends the well-known Bloomjoin algorithm to multi-way joins and achieves state-of-the-art performance in terms of join query execution efficiency. SieveJoin's salient novel feature is that it allows the propagation of Bloom filters in the join path, enabling the system to `stop early' and eliminate useless intermediate join results. The key design objective of SieveJoin is to efficiently `learn' the join results, based on Bloom filters, with negligible memory overheads. We discuss the bottlenecks in delaying multi-way joins, and how Bloom filters are used to remove the generation of unnecessary intermediate join results. We provide a detailed experimental evaluation using various datasets, against a state-of-the-art column-store database and a multi-way worst-case optimal join algorithm, showcasing SieveJoin's gains in terms of response time.
翻译:提升数据系统连接操作性能长期以来都是一个至关重要的问题。近年来,研究重点多集中于多路连接性能,尤其是减少最终未出现在结果集中的中间元组产生的负面影响。我们提出了一种名为SieveJoin的新型多路连接算法,该算法将著名的Bloomjoin算法扩展至多路连接场景,并在连接查询执行效率方面达到了当前最优性能。SieveJoin的突出创新特性在于允许布隆过滤器沿连接路径传播,使系统能够“提前终止”并消除无用的中间连接结果。SieveJoin的核心设计目标是基于布隆过滤器高效“学习”连接结果,且内存开销可忽略不计。我们讨论了延迟多路连接的瓶颈所在,以及如何使用布隆过滤器消除不必要中间连接结果的生成。通过使用多种数据集,我们开展了详尽的实验评估,并与当前最优的列存储数据库及一种多路最坏情况最优连接算法进行对比,展示了SieveJoin在响应时间方面的优势。