Towards General and Efficient Online Tuning for Spark

The distributed data analytic system -- Spark is a common choice for processing massive volumes of heterogeneous data, while it is challenging to tune its parameters to achieve high performance. Recent studies try to employ auto-tuning techniques to solve this problem but suffer from three issues: limited functionality, high overhead, and inefficient search. In this paper, we present a general and efficient Spark tuning framework that can deal with the three issues simultaneously. First, we introduce a generalized tuning formulation, which can support multiple tuning goals and constraints conveniently, and a Bayesian optimization (BO) based solution to solve this generalized optimization problem. Second, to avoid high overhead from additional offline evaluations in existing methods, we propose to tune parameters along with the actual periodic executions of each job (i.e., online evaluations). To ensure safety during online job executions, we design a safe configuration acquisition method that models the safe region. Finally, three innovative techniques are leveraged to further accelerate the search process: adaptive sub-space generation, approximate gradient descent, and meta-learning method. We have implemented this framework as an independent cloud service, and applied it to the data platform in Tencent. The empirical results on both public benchmarks and large-scale production tasks demonstrate its superiority in terms of practicality, generality, and efficiency. Notably, this service saves an average of 57.00% memory cost and 34.93% CPU cost on 25K in-production tasks within 20 iterations, respectively.

翻译：分布式数据分析系统——Spark是处理大规模异构数据的常用选择，但调优其参数以实现高性能颇具挑战。近期研究尝试采用自动调优技术解决此问题，却存在三类缺陷：功能受限、开销高昂及搜索效率低下。本文提出一种通用且高效的Spark调优框架，可同时应对这三类问题。首先，我们引入一种通用调优公式，能够便捷地支持多种调优目标与约束，并基于贝叶斯优化（BO）设计解决方案来处理该通用优化问题。其次，为避免现有方法中额外离线评估带来的高开销，我们提出结合每个作业的实际周期性执行（即在线评估）进行参数调优。为确保在线作业执行的安全性，我们设计了一种安全配置获取方法，对安全区域进行建模。最后，我们利用三项创新技术进一步加速搜索过程：自适应子空间生成、近似梯度下降及元学习方法。该框架已作为独立云服务实现，并应用于腾讯数据平台。在公开基准测试与大规模生产任务上的实验结果表明，该框架在实用性、通用性与效率方面均具有显著优势。值得注意的是，在20次迭代内，该服务平均为25K个生产任务节省了57.00%的内存成本与34.93%的CPU成本。