We study the complexity of evaluating queries on probabilistic databases under bag semantics. We focus on self-join free conjunctive queries, and probabilistic databases where occurrences of different facts are independent, which is the natural generalization of tuple-independent probabilistic databases to the bag semantics setting. For set semantics, the data complexity of this problem is well understood, even for the more general class of unions of conjunctive queries: it is either in polynomial time, or #P-hard, depending on the query (Dalvi & Suciu, JACM 2012). A reasonably general model of bag probabilistic databases may have unbounded multiplicities. In this case, the probabilistic database is no longer finite, and a careful treatment of representation mechanisms is required. Moreover, the answer to a Boolean query is a probability distribution over (possibly all) non-negative integers, rather than a probability distribution over { true, false }. Therefore, we discuss two flavors of probabilistic query evaluation: computing expectations of answer tuple multiplicities, and computing the probability that a tuple is contained in the answer at most k times for some parameter k. Subject to mild technical assumptions on the representation systems, it turns out that expectations are easy to compute, even for unions of conjunctive queries. For query answer probabilities, we obtain a dichotomy between solvability in polynomial time and #P-hardness for self-join free conjunctive queries.
翻译:我们研究了在袋语义下对概率数据库进行查询评估的复杂度问题。重点关注无自连接合取查询,以及不同事实的出现相互独立的概率数据库——这是元组独立概率数据库在袋语义设置下的自然推广。在集合语义下,即使对于更一般的合取查询并集类,该问题的数据复杂度也已得到充分理解:其要么为多项式时间可解,要么为#P-难问题,具体取决于查询本身(Dalvi & Suciu, JACM 2012)。一个较为通用的袋概率数据库模型可能具有无界多重性。此时,概率数据库不再有限,需要对表示机制进行审慎处理。此外,布尔查询的答案不再是关于{真,假}的概率分布,而是关于(可能所有)非负整数的概率分布。因此,我们讨论了两种概率查询评估形式:计算答案元组多重性的期望值,以及计算某个元组在答案中出现次数不超过给定参数k次的概率。在关于表示系统的温和技术假设下,结果表明期望值的计算是简单的,即使对于合取查询的并集也是如此。对于查询答案概率,我们得到了无自连接合取查询在多项式时间可解与#P-难解之间的二分性。