Distributed data analytic engines like Spark are common choices to process massive data in industry. However, the performance of Spark SQL highly depends on the choice of configurations, where the optimal ones vary with the executed workloads. Among various alternatives for Spark SQL tuning, Bayesian optimization (BO) is a popular framework that finds near-optimal configurations given sufficient budget, but it suffers from the re-optimization issue and is not practical in real production. When applying transfer learning to accelerate the tuning process, we notice two domain-specific challenges: 1) most previous work focus on transferring tuning history, while expert knowledge from Spark engineers is of great potential to improve the tuning performance but is not well studied so far; 2) history tasks should be carefully utilized, where using dissimilar ones lead to a deteriorated performance in production. In this paper, we present Rover, a deployed online Spark SQL tuning service for efficient and safe search on industrial workloads. To address the challenges, we propose generalized transfer learning to boost the tuning performance based on external knowledge, including expert-assisted Bayesian optimization and controlled history transfer. Experiments on public benchmarks and real-world tasks show the superiority of Rover over competitive baselines. Notably, Rover saves an average of 50.1% of the memory cost on 12k real-world Spark SQL tasks in 20 iterations, among which 76.2% of the tasks achieve a significant memory reduction of over 60%.
翻译:分布式数据分析引擎(如Spark)是工业界处理海量数据的常用选择。然而,Spark SQL的性能高度依赖于配置参数的选择,且最优配置会随执行工作负载而变化。在多种Spark SQL调优方法中,贝叶斯优化(BO)是一种能够在给定充足预算条件下找到近似最优配置的主流框架,但该方法存在重优化问题,在实际生产中缺乏实用性。在应用迁移学习加速调优过程时,我们注意到两个领域特定挑战:1)现有研究大多聚焦于迁移调优历史,而Spark工程师的专家知识具有提升调优性能的潜力,但迄今尚未得到充分研究;2)历史任务需谨慎筛选利用,使用不相似的历史任务会导致生产环境性能恶化。本文提出Rover——一种部署于生产环境的在线Spark SQL调优服务,旨在实现面向工业级工作负载的高效安全搜索。针对上述挑战,我们提出基于外部知识的广义迁移学习以提升调优性能,具体包括专家辅助贝叶斯优化与受控历史迁移。在公开基准测试和实际任务上的实验表明,Rover相较于对比基线方法具有显著优势。值得注意的是,在20轮迭代中,Rover对1.2万个真实Spark SQL任务的平均内存成本节省达50.1%,其中76.2%的任务实现了超过60%的内存显著降低。