The computation of the skyline provides a mechanism for utilizing multiple location-based criteria to identify optimal data points. However, the efficiency of these computations diminishes and becomes more challenging as the input data expands. This study presents a novel algorithm aimed at mitigating this challenge by harnessing the capabilities of Apache Spark, a distributed processing platform, for conducting area skyline computations. The proposed algorithm enhances processing speed and scalability. In particular, our algorithm encompasses three key phases: the computation of distances between data points, the generation of distance tuples, and the execution of the skyline operators. Notably, the second phase employs a local partial skyline extraction technique to minimize the volume of data transmitted from each executor (a parallel processing procedure) to the driver (a central processing procedure). Afterwards, the driver processes the received data to determine the final skyline and creates filters to exclude irrelevant points. Extensive experimentation on eight datasets reveals that our algorithm significantly reduces both data size and computation time required for area skyline computation.
翻译:天空线计算提供了一种利用多个基于位置的标准来识别最优数据点的机制。然而,随着输入数据的增长,这些计算的效率会降低,并变得更具挑战性。本研究提出了一种新型算法,旨在通过利用分布式处理平台Apache Spark的能力来缓解这一挑战,用于执行区域天空线计算。所提出的算法提高了处理速度和可扩展性。具体而言,我们的算法包含三个关键阶段:数据点间距离的计算、距离元组的生成以及天空线算子的执行。值得注意的是,第二阶段采用局部部分天空线提取技术,以最小化从每个执行器(并行处理过程)传输到驱动程序(中央处理过程)的数据量。随后,驱动程序处理接收到的数据以确定最终天空线,并创建过滤器来排除无关数据点。在八个数据集上进行的大量实验表明,我们的算法显著减少了区域天空线计算所需的数据量和计算时间。