While classical skyline queries identify interesting data within large datasets, flexible skylines introduce preferences through constraints on attribute weights, and further reduce the data returned. However, computing these queries can be time-consuming for large datasets. We propose and implement a parallel computation scheme consisting of a parallel phase followed by a sequential phase, and apply it to flexible skylines. We assess the additional effect of an initial filtering phase to reduce dataset size before parallel processing, and the elimination of the sequential part (the most time-consuming) altogether. All our experiments are executed in the PySpark framework for a number of different datasets of varying sizes and dimensions.
翻译:尽管经典天际线查询能在大型数据集中识别出有意义的数据,但柔性天际线通过约束属性权重引入偏好,进一步减少了返回的数据量。然而,对于大型数据集,计算此类查询可能非常耗时。我们提出并实现了一种并行计算方案,该方案包含并行阶段和顺序阶段,并将其应用于柔性天际线计算。我们评估了在并行处理前增加初始过滤阶段以减少数据集规模的效果,以及完全消除顺序部分(最耗时的环节)的额外影响。所有实验均在PySpark框架中执行,使用了多个不同规模和维度的数据集。