Selecting appropriate distributed join methods for logical join operations in a query plan is crucial for the performance of data-intensive scalable computing (DISC). Different network communication patterns in the data exchange phase generate varying network communication workloads and significantly affect the distributed join performance. However, most cost-based query optimizers focus on the local computing cost and do not precisely model the network communication cost. We propose a cost model for various distributed join methods to optimize join queries in DISC platforms. Our method precisely measures the network and local computing workloads in different execution phases, using information on the size and cardinality statistics of datasets and cluster join parallelism. Our cost model reveals the importance of the relative size of the joining datasets. We implement an efficient distributed join selection strategy, known as RelJoin in SparkSQL, which is an industry-prevalent distributed data processing framework. RelJoin uses runtime adaptive statistics for accurate cost estimation and selects optimal distributed join methods for logical joins to optimize the physical query plan. The evaluation results on the TPC-DS benchmark show that RelJoin performs best in 62 of the 97 queries and can reduce the average query time by 21% compared with other strategies.
翻译:为查询计划中的逻辑连接操作选择适当的分布式连接方法,对于数据密集型可扩展计算(DISC)的性能至关重要。数据交换阶段中不同的网络通信模式会产生不同的网络通信负载,并显著影响分布式连接性能。然而,大多数基于成本的查询优化器侧重于本地计算成本,并未精确建模网络通信成本。我们提出了一种针对各种分布式连接方法的成本模型,以优化DISC平台上的连接查询。该方法利用数据集大小和基数统计信息以及集群连接并行度,精确衡量不同执行阶段的网络和本地计算负载。我们的成本模型揭示了连接数据集相对大小的重要性。我们在业界主流的分布式数据处理框架SparkSQL中实现了一种高效的分布式连接选择策略,称为RelJoin。RelJoin利用运行时自适应统计信息进行准确的成本估算,并为逻辑连接选择最优的分布式连接方法以优化物理查询计划。在TPC-DS基准测试上的评估结果表明,RelJoin在97个查询中的62个上表现最佳,且相比其他策略平均查询时间减少了21%。