Significant research effort has been devoted to improving the performance of join processing in the massively parallel computation model, where the goal is to evaluate a query with the minimum possible data transfer between machines. However, it is still an open question to determine the best possible parallel algorithm for any join query. In this paper, we present an algorithm that takes a step forward in this endeavour. Our new algorithm is simple and builds on two existing ideas: data partitioning and the HyperCube primitive. The novelty in our approach comes from a careful choice of the HyperCube shares, which is done as a linear combination of multiple vertex covers. The resulting load with input size $n$ and $p$ processors is characterized as $n/p^{1/κ}$, where $κ$ is a new hypergraph theoretic measure we call the reduced quasi vertex-cover. The new measure matches or improves on all state-of-the-art algorithms and exhibits strong similarities to the edge quasi-packing that describes the worst-case optimal load in one-round algorithms.
翻译:在大规模并行计算模型中,已有大量研究工作致力于提升连接处理的性能,其目标是在机器间实现最小可能的数据传输以完成查询评估。然而,针对任意连接查询确定最优并行算法仍是一个开放性问题。本文提出一种算法,在此方向上迈出了重要一步。我们的新算法结构简洁,建立在数据分区与HyperCube原语这两个现有思想之上。本方法的创新之处在于对HyperCube份额的精细选择,该选择通过多个顶点覆盖的线性组合实现。在输入规模为$n$、处理器数量为$p$的条件下,所得负载特征可表示为$n/p^{1/κ}$,其中$κ$是我们提出的超图理论度量——约化拟顶点覆盖。该新度量在所有前沿算法中均达到或优于现有性能,并且与描述单轮算法最坏情况最优负载的边拟填充度量表现出高度相似性。