Estimating the output size of a join query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true join size by orders of magnitude, which leads to significant system performance penalty. Recently, size upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet they grossly overestimate the true join size. This paper puts forward a general class of size bounds that are based on information inequalities involving Lp-norms on the degree sequences of the join columns. They generalise prior efforts and can be asymptotically tighter than the known bounds. We give two types of lower and upper bounds: some hold for all entropic vectors, while others hold for all polymatroids. Whereas the former are asymptotically tight but possibly not computable, the latter are computable but not even asymptotically tight. In the case when all degree constraints are over a single variable then we call them "simple", and prove that the polymatroid and entropic bounds are equal, they are tight up to a query-dependent constant (which is stronger than asymptotically tight), are computable in exponential time in the size of the query, and that the worst case database instance that matches the bound has a simple structure called a "normal database".
翻译:估计连接查询的输出规模是数据库查询处理中一个基本且长期存在的问题。数据库系统使用的传统基数估计器通常会将真实连接规模低估数个数量级,从而导致系统性能显著下降。近年来,基于信息不等式并考虑输入关系规模与最大度数的上界被提出,但它们会严重高估真实连接规模。本文提出了一类基于连接列度序列的Lp范数信息不等式的通用规模界。这些界推广了先前的研究成果,并且渐近地比已知界更紧。我们给出了两种类型的下界与上界:一部分适用于所有熵向量,另一部分适用于所有多拟阵。前者是渐近紧的但可能无法计算,后者是可计算的但甚至不是渐近紧的。当所有度数约束均针对单一变量时,我们称其为“简单约束”,并证明此时多拟阵界与熵界相等,它们在查询相关常数意义上(比渐近紧更强)是紧的,可在查询规模指数时间内计算,且匹配该界的最坏情况数据库实例具有称为“标准数据库”的简单结构。