Uncertainty arises naturally inmany application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results using separate, incompatible datamodels. We present an efficient approach for under- and over-approximating results of ranking, top-k, and window queries over uncertain data. Our approach integrates well with existing techniques for querying uncertain data, is efficient, and is to the best of our knowledge the first to support windowed aggregation. We design algorithms for physical operators for uncertain sorting and windowed aggregation, and implement them in PostgreSQL.We evaluated our approach on synthetic and real world datasets, demonstrating that it outperforms all competitors, and often produces more accurate results.
翻译:不确定性在众多应用领域中自然产生,例如数据输入错误和数据清洗中的歧义。先前关于不完备数据库和概率数据库的研究探讨了不确定数据上排名和top-k查询的语义及高效评估。然而,大多数方法孤立地处理top-k和排名,并使用分离且不兼容的数据模型来表示不确定输入数据和查询结果。我们提出了一种高效方法,用于逼近不确定数据上排名、top-k和窗口查询结果的下界和上界。该方法与现有的不确定数据查询技术良好集成,具有高效性,并且据我们所知,它是首个支持窗口聚合的方法。我们设计了用于不确定排序和窗口聚合的物理操作符算法,并在PostgreSQL中实现。我们在合成数据集和真实数据集上评估了该方法,证明其优于所有现有方法,并且通常能产生更精确的结果。