Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet they their main benefit is limited to cyclic queries, because they degenerate to rather trivial formulas on acyclic queries. We introduce a significant extension of the upper bounds, by incorporating $\ell_p$-norms of the degree sequences of join attributes. Our bounds are significantly lower than previously known bounds, even when applied to acyclic queries. These bounds are also based on information theory, they come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when all degrees are "simple".
翻译:查询输出大小的估计是数据库查询处理中一个基础但长期存在的问题。数据库系统使用的传统基数估计器通常会对真实输出大小低估数个数量级,从而导致系统性能显著下降。近年来,基于信息不等式并结合输入关系的大小和最大度数的上界已被提出,但其主要优势仅局限于循环查询,因为对于无环查询,它们退化为相当简单的公式。我们通过引入连接属性的度序列的$\ell_p$-范数,对上界进行了重要扩展。即使应用于无环查询,我们的界也显著低于先前已知的界。这些界同样基于信息论,并配有匹配的查询评估算法,可在查询大小的指数时间内计算,且当所有度数为“简单”时,被证明是紧的。