Storage-based joins are still commonly used today because the memory budget does not always scale with the data size. One of the many join algorithms developed that has been widely deployed and proven to be efficient is the Hybrid Hash Join (HHJ), which is designed to exploit any available memory to maximize the data that is joined directly in memory. However, HHJ cannot fully exploit detailed knowledge of the join attribute correlation distribution. In this paper, we show that given a correlation skew in the join attributes, HHJ partitions data in a suboptimal way. To do that, we derive the optimal partitioning using a new cost-based analysis of partitioning-based joins that is tailored for primary key - foreign key (PK-FK) joins, one of the most common join types. This optimal partitioning strategy has a high memory cost, thus, we further derive an approximate algorithm that has tunable memory cost and leads to near-optimal results. Our algorithm, termed NOCAP (Near-Optimal Correlation-Aware Partitioning) join, outperforms the state-of-the-art for skewed correlations by up to $30\%$, and the textbook Grace Hash Join by up to $4\times$. Further, for a limited memory budget, NOCAP outperforms HHJ by up to $10\%$, even for uniform correlation. Overall, NOCAP dominates state-of-the-art algorithms and mimics the best algorithm for a memory budget varying from below $\sqrt{\|\text{relation}\|}$ to more than $\|\text{relation}\|$.
翻译:基于存储的连接至今仍被广泛使用,因为内存预算并不总是随数据规模同步增长。在众多已开发并广泛部署且被证明高效的连接算法中,混合哈希连接(HHJ)旨在利用所有可用内存来最大化直接在内存中连接的数据量。然而,HHJ无法充分利用连接属性相关分布的详细知识。本文表明,当连接属性存在相关性倾斜时,HHJ的分区方式并非最优。为此,我们针对最常用的连接类型之一——主键-外键(PK-FK)连接,基于新的基于代价的分区连接分析推导出最优分区策略。该最优分区策略内存代价较高,因此我们进一步推导出一种内存代价可调的近似算法,能实现近最优结果。我们提出的算法NOCAP(近最优相关感知分区)连接在相关性倾斜场景下性能优于现有技术达30%,相较于经典的Grace哈希连接提升可达4倍。此外,在有限内存预算下,即使面对均匀相关性,NOCAP相比HHJ仍可提升高达10%的性能。总体而言,NOCAP全面优于现有算法,在内存预算从低于$\sqrt{\|\text{relation}\|}$到超过$\|\text{relation}\|$的范围内,均能复现最优算法性能。