MFTune: An Efficient Multi-fidelity Framework for Spark SQL Configuration Tuning

Apache Spark SQL is a cornerstone of modern big data analytics.However,optimizing Spark SQL performance is challenging due to its vast configuration space and the prohibitive cost of evaluating massive workloads. Existing tuning methods predominantly rely on full-fidelity evaluations, which are extremely time-consuming,often leading to suboptimal performance within practical budgets.While multi-fidelity optimization offers a potential solution, directly applying standard techniques-such as data volume reduction or early stopping-proves ineffective for Spark SQL as they fail to preserve performance correlations or represent true system bottlenecks. To address these challenges, we propose MFTune, an efficient multi-fidelity framework that introduces a query-based fidelity partitioning strategy, utilizing representative SQL subsets to provide accurate, low-cost proxies. To navigate the huge search space, MFTune incorporates a density-based optimization mechanism for automated knob and range compression, alongside an adapted transfer learning approach and a two-phase warm start to further accelerate the tuning process. Experimental results on TPC-H and TPC-DS benchmarks demonstrate that MFTune significantly outperforms five state-of-the-art tuning methods, identifying superior configurations within practical time constraints.

翻译：Apache Spark SQL是现代大数据分析的基石。然而，由于其庞大的配置空间以及评估海量工作负载的过高成本，优化Spark SQL性能极具挑战性。现有的调优方法主要依赖于全保真度评估，这极其耗时，常常导致在实际预算内只能获得次优性能。虽然多保真度优化提供了一种潜在的解决方案，但直接应用标准技术——例如减少数据量或提前停止——对Spark SQL被证明是无效的，因为它们无法保持性能相关性或反映真实的系统瓶颈。为了应对这些挑战，我们提出了MFTune，一种高效的多保真度框架，它引入了一种基于查询的保真度划分策略，利用具有代表性的SQL子集来提供准确、低成本的代理。为了在巨大的搜索空间中导航，MFTune结合了一种基于密度的优化机制，用于自动进行参数及范围压缩，同时采用了一种自适应的迁移学习方法以及一个两阶段热启动策略，以进一步加速调优过程。在TPC-H和TPC-DS基准测试上的实验结果表明，MFTune显著优于五种先进的调优方法，能够在实际时间限制内找到更优的配置。

相关内容

Spark

关注 51

Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架，Spark，拥有Hadoop MapReduce所具有的优点；但不同于MapReduce的是Job中间输出结果可以保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。