In a large database system, upper-bounding the cardinality of a join query is a crucial task called $\textit{pessimistic cardinality estimation}$. Recently, Abo Khamis, Nakos, Olteanu, and Suciu unified related works into the following dexterous framework. Step 1: Let $(X_1, \dotsc, X_n)$ be a random row of the join, equating $H(X_1, \dotsc, X_n)$ to the log of the join cardinality. Step 2: Upper-bound $H(X_1, \dotsc, X_n)$ using Shannon-type inequalities such as $H(X, Y, Z) \le H(X) + H(Y|X) + H(Z|Y)$. Step 3: Upper-bound $H(X_i) + p H(X_j | X_i)$ using the $p$-norm of the degree sequence of the underlying graph of a relation. While old bound in step 3 count "claws $\in$" in the underlying graph, we proposed $\textit{ambidextrous}$ bounds that count "claw pairs ${\ni}\!{-}\!{\in}$". The new bounds are provably not looser and empirically tighter: they overestimate by $x^{3/4}$ times when the old bounds overestimate by $x$ times. An example is counting friend triples in the $\texttt{com-Youtube}$ dataset, the best dexterous bound is $1.2 \cdot 10^9$, the best ambidextrous bound is $5.1 \cdot 10^8$, and the actual cardinality is $1.8 \cdot 10^7$.
翻译:在大型数据库系统中,对连接查询的基数进行上界估计是一项关键任务,称为$\textit{悲观基数估计}$。最近,Abo Khamis、Nakos、Olteanu和Suciu将相关工作统一为以下灵巧框架。第一步:令$(X_1, \dotsc, X_n)$为连接结果中的随机一行,将$H(X_1, \dotsc, X_n)$等同于连接基数的对数。第二步:使用香农型不等式(例如$H(X, Y, Z) \le H(X) + H(Y|X) + H(Z|Y)$)对$H(X_1, \dotsc, X_n)$进行上界估计。第三步:利用关系底层图的度序列的$p$-范数,对$H(X_i) + p H(X_j | X_i)$进行上界估计。虽然第三步中的旧界统计底层图中的“爪形结构$\in$”,但我们提出了$\textit{左右开弓}$界,其统计的是“爪对结构${\ni}\!{-}\!{\in}$”。新界在理论上被证明不比旧界更宽松,且在经验上更紧:当旧界高估$x$倍时,新界仅高估$x^{3/4}$倍。以统计$\texttt{com-Youtube}$数据集中的朋友三元组为例,最佳灵巧界为$1.2 \cdot 10^9$,最佳左右开弓界为$5.1 \cdot 10^8$,而实际基数为$1.8 \cdot 10^7$。