In today's data-driven world, algorithms operating with vertically distributed datasets are crucial due to the increasing prevalence of large-scale, decentralized data storage. These algorithms enhance data privacy by processing data locally, reducing the need for data transfer and minimizing exposure to breaches. They also improve scalability, as they can handle vast amounts of data spread across multiple locations without requiring centralized access. Top-k queries have been studied extensively under this lens, and are particularly suitable in applications involving healthcare, finance, and IoT, where data is often sensitive and distributed across various sources. Classical top-k algorithms are based on the availability of two kinds of access to sources: sorted access, i.e., a sequential scan in the internal sort order, one tuple at a time, of the dataset; random access, which provides all the information available at a data source for a tuple whose id is known. However, in scenarios where data retrieval costs are high or data is streamed in real-time or, simply, data are from external sources that only offer sorted access, random access may become impractical or impossible, due to latency issues or data access constraints. Fortunately, a long tradition of algorithms designed for the "no random access" (NRA) scenario exists for classical top-k queries. Yet, these do not cover the recent advances in ranking queries, proposing hybridizations of top-k queries (which are preference-aware and control the output size) and skyline queries (which are preference-agnostic and have uncontrolled output size). The non-dominated flexible skyline (ND) is one such proposal. We introduce an algorithm for computing ND in the NRA scenario, prove its correctness and optimality within its class, and provide an experimental evaluation covering a wide range of cases, with both synthetic and real datasets.
翻译:在当今数据驱动的世界中,由于大规模去中心化数据存储日益普及,基于垂直分布数据集运行的算法变得至关重要。这些算法通过本地处理数据来增强隐私保护,减少数据传输需求并降低泄露风险。同时它们提升了可扩展性,能够处理分布在多个位置的海量数据而无需集中访问。在此背景下,top-k查询得到了广泛研究,特别适用于医疗健康、金融和物联网等涉及敏感且分散数据的应用场景。经典top-k算法基于两种数据源访问方式:排序访问(即按内部排序顺序逐元组顺序扫描数据集)和随机访问(在已知元组标识符时获取数据源中所有可用信息)。然而,在数据检索成本高昂、数据实时流式传输或数据源仅支持排序访问的场景中,由于延迟问题或数据访问限制,随机访问可能变得不切实际或无法实现。值得庆幸的是,针对经典top-k查询的“无随机访问”(NRA)场景已存在长期算法研究传统。但这些研究尚未涵盖排序查询的最新进展——即融合top-k查询(具有偏好感知且可控制输出规模)与天际线查询(无偏好感知且输出规模不可控)的混合方法。非支配柔性天际线(ND)正是此类新型提案之一。本文提出一种在NRA场景下计算ND的算法,证明其在该类算法中的正确性与最优性,并通过合成数据集和真实数据集的广泛实验评估验证算法性能。