Uncertainty arises naturally inmany application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results using separate, incompatible datamodels. We present an efficient approach for under- and over-approximating results of ranking, top-k, and window queries over uncertain data. Our approach integrates well with existing techniques for querying uncertain data, is efficient, and is to the best of our knowledge the first to support windowed aggregation. We design algorithms for physical operators for uncertain sorting and windowed aggregation, and implement them in PostgreSQL.We evaluated our approach on synthetic and real world datasets, demonstrating that it outperforms all competitors, and often produces more accurate results.
翻译:不确定性在许多应用领域中自然产生,例如数据录入错误和数据清洗中的歧义。先前关于不完整数据库和概率数据库的研究探讨了不确定数据上排序和top-k查询的语义及高效评估。然而,大多数方法孤立地处理top-k和排序,并使用分离且不兼容的数据模型来表示不确定输入数据和查询结果。我们提出了一种高效的方法,用于对不确定数据上的排序、top-k和窗口查询结果进行下限和上限近似。该方法与现有的不确定数据查询技术良好集成,具有高效性,并且据我们所知是首个支持窗口聚合的方法。我们设计了用于不确定排序和窗口聚合的物理算子算法,并在PostgreSQL中实现。我们在合成数据集和真实世界数据集上评估了该方法,证明其优于所有竞争对手,且通常能产生更准确的结果。