Distributed Data Processing Platforms (e.g., Hadoop, Spark, and Flink) are widely used to store and process data in a cloud environment. These platforms distribute the storage and processing of data among the computing nodes of a cloud. The efficient use of these platforms requires users to (i) configure the cloud i.e., determine the number and type of computing nodes, and (ii) tune the configuration parameters (e.g., data replication factor) of the platform. However, both these tasks require in-depth knowledge of the cloud infrastructure and distributed data processing platforms. Therefore, in this paper, we first study the relationship between the configuration of the cloud and the configuration of distributed data processing platforms to determine how cloud configuration impacts platform configuration. After understanding the impacts, we propose a co-tuning approach for recommending optimal co-configuration of cloud and distributed data processing platforms. The proposed approach utilizes machine learning and optimization techniques to maximize the performance of the distributed data processing system deployed on the cloud. We evaluated our approach for Hadoop, Spark, and Flink in a cluster deployed on the OpenStack cloud. We used three benchmarking workloads (WordCount, Sort, and K-means) in our evaluation. Our results reveal that, in comparison to default settings, our co-tuning approach reduces execution time by 17.5% and $ cost by 14.9% solely via configuration tuning.
翻译:分布式数据处理平台(如Hadoop、Spark和Flink)被广泛应用于云环境中的数据存储与处理。这些平台将数据的存储和处理任务分散到云中的计算节点上。高效使用这些平台需要用户:(i)配置云环境,即确定计算节点的数量和类型;(ii)调优平台的配置参数(例如数据复制因子)。然而,这两项任务都需要对云基础设施和分布式数据处理平台有深入理解。因此,本文首先研究云配置与分布式数据处理平台配置之间的关系,以确定云配置对平台配置的具体影响。在理解这些影响后,我们提出了一种协同调优方法,用于推荐云与分布式数据处理平台的最优联合配置。该方法利用机器学习和优化技术,最大化部署在云上的分布式数据处理系统的性能。我们在部署于OpenStack云上的集群中,针对Hadoop、Spark和Flink进行了评估,并采用三个基准测试负载(WordCount、Sort和K-means)。实验结果表明,与默认配置相比,我们的协同调优方法仅通过配置调优即可将执行时间降低17.5%,成本降低14.9%。