Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance, the search space for an optimal resource configuration can be greatly reduced. Therefore, we present Ruya, a method for memory-aware optimization of data processing cluster configurations based on iteratively exploring a narrowed-down search space. First, we perform job profiling runs with small samples of the dataset on just a single machine to model the job's memory usage patterns. Second, we prioritize cluster configurations with a suitable amount of total memory and within this reduced search space, we iteratively search for the best cluster configuration with Bayesian optimization. This search process stops once it converges on a configuration that is believed to be optimal for the given job. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of search iterations to find an optimal configuration by around half, compared to the baseline.
翻译:为大型集群上的数据处理作业选择合适计算资源十分困难,即便是数据工程师等专业用户也不例外。不当的选择可能导致成本大幅增加,却无法显著提升性能。选择高效资源配置的关键在于避免内存瓶颈。预先掌握作业所需内存量,可大幅缩小最优资源配置的搜索空间。为此,我们提出Ruya——一种基于迭代搜索缩减搜索空间的内存感知型数据处理集群配置优化方法。首先,我们在单台机器上利用小样本数据集执行作业分析运行,以建模作业的内存使用模式。其次,优先选择总内存量合适的集群配置,并在该缩减后的搜索空间内,通过贝叶斯优化迭代搜索最优集群配置。当搜索收敛于某个对当前作业而言被认为最优的配置时,该搜索过程即终止。在包含1031个Spark与Hadoop作业的数据集上的评估表明,与基准方法相比,本方法用于寻找最优配置的搜索迭代次数减少约一半。