Minimizing intermediate results is critical for efficient multi-join query processing. Although the seminal Yannakakis algorithm offers strong guarantees for acyclic queries, cyclic queries remain an open challenge. In this paper, we propose SplitJoin, a framework that introduces split as a first-class query operator. By partitioning input tables into heavy and light parts, SplitJoin allows different data partitions to use distinct query plans, with the goal of reducing intermediate sizes using existing binary join engines. We systematically explore the design space for split-based optimizations, including threshold selection, split strategies, and join ordering after splits. Implemented as a front-end to DuckDB and Umbra, SplitJoin achieves substantial improvements: on DuckDB, SplitJoin completes 43 social network queries (vs. 29 natively), achieving 2.1x faster runtime and 7.9x smaller intermediates on average (up to 13.6x and 74x, respectively); on Umbra, it completes 45 queries (vs. 35), achieving 1.3x speedups and 1.2x smaller intermediates on average (up to 6.1x and 2.1x, respectively).
翻译:最小化中间结果对于高效的多连接查询处理至关重要。尽管经典的Yannakakis算法为无环查询提供了强有力的保证,但环状查询仍然是一个开放挑战。本文提出SplitJoin框架,将分片作为一等查询操作符引入。通过对输入表进行重分区和轻量分区的划分,SplitJoin允许不同数据分区采用差异化的查询计划,旨在利用现有的二元连接引擎减少中间结果规模。我们系统性地探索了基于分片的优化设计空间,包括阈值选择、分片策略以及分片后的连接顺序优化。作为DuckDB和Umbra的前端实现,SplitJoin取得了显著改进:在DuckDB上,SplitJoin完成了43个社交网络查询(原生系统为29个),平均实现2.1倍的运行速度提升和7.9倍的中间结果缩减(最高分别达13.6倍和74倍);在Umbra上,它完成了45个查询(原生系统为35个),平均获得1.3倍加速和1.2倍的中间结果缩减(最高分别达6.1倍和2.1倍)。