Apache Spark SQL is a cornerstone of modern big data analytics.However,optimizing Spark SQL performance is challenging due to its vast configuration space and the prohibitive cost of evaluating massive workloads. Existing tuning methods predominantly rely on full-fidelity evaluations, which are extremely time-consuming,often leading to suboptimal performance within practical budgets.While multi-fidelity optimization offers a potential solution, directly applying standard techniques-such as data volume reduction or early stopping-proves ineffective for Spark SQL as they fail to preserve performance correlations or represent true system bottlenecks. To address these challenges, we propose MFTune, an efficient multi-fidelity framework that introduces a query-based fidelity partitioning strategy, utilizing representative SQL subsets to provide accurate, low-cost proxies. To navigate the huge search space, MFTune incorporates a density-based optimization mechanism for automated knob and range compression, alongside an adapted transfer learning approach and a two-phase warm start to further accelerate the tuning process. Experimental results on TPC-H and TPC-DS benchmarks demonstrate that MFTune significantly outperforms five state-of-the-art tuning methods, identifying superior configurations within practical time constraints.
翻译:Apache Spark SQL是现代大数据分析的基石。然而,由于其庞大的配置空间以及评估海量工作负载的过高成本,优化Spark SQL性能极具挑战性。现有的调优方法主要依赖于全保真度评估,这极其耗时,常常导致在实际预算内只能获得次优性能。虽然多保真度优化提供了一种潜在的解决方案,但直接应用标准技术——例如减少数据量或提前停止——对Spark SQL被证明是无效的,因为它们无法保持性能相关性或反映真实的系统瓶颈。为了应对这些挑战,我们提出了MFTune,一种高效的多保真度框架,它引入了一种基于查询的保真度划分策略,利用具有代表性的SQL子集来提供准确、低成本的代理。为了在巨大的搜索空间中导航,MFTune结合了一种基于密度的优化机制,用于自动进行参数及范围压缩,同时采用了一种自适应的迁移学习方法以及一个两阶段热启动策略,以进一步加速调优过程。在TPC-H和TPC-DS基准测试上的实验结果表明,MFTune显著优于五种先进的调优方法,能够在实际时间限制内找到更优的配置。