Estimating the cardinality of the output of a query is a fundamental problem in database query processing. In this article, we overview a recently published contribution that casts the cardinality estimation problem as linear optimization and computes guaranteed upper bounds on the cardinality of the output for any full conjunctive query. The objective of the linear program is to maximize the joint entropy of the query variables and its constraints are the Shannon information inequalities and new information inequalities involving $\ell_p$-norms of the degree sequences of the join attributes. The bounds based on arbitrary norms can be asymptotically lower than those based on the $\ell_1$ and $\ell_\infty$ norms, which capture the cardinalities and respectively the max-degrees of the input relations. They come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when each degree sequence is on one join attribute.
翻译:在数据库查询处理中,估计查询输出的基数是一个基础性问题。本文综述了一项近期发表的研究成果,该成果将基数估计问题建模为线性优化问题,并为任意全连接合取查询的输出基数计算有保证的上界。线性规划的目标是最大化查询变量的联合熵,其约束条件包括香农信息不等式以及涉及连接属性度序列 $\ell_p$ 范数的新信息不等式。基于任意范数的界可能渐近低于基于 $\ell_1$ 和 $\ell_\infty$ 范数的界,后者分别捕获输入关系的基数和最大度数。这些界配有匹配的查询评估算法,可在查询规模的指数时间内计算,并且当每个度序列仅涉及一个连接属性时,被证明是紧致的。