We propose a new method for estimating the number of answers OUT of a small join query Q in a large database D, and for uniform sampling over joins. Our method is the first to satisfy all the following statements. - Support arbitrary Q, which can be either acyclic or cyclic, and contain binary and non-binary relations. - Guarantee an arbitrary small error with a high probability always in \~O(AGM/OUT) time, where AGM is the AGM bound OUT (an upper bound of OUT), and \~O hides the polylogarithmic factor of input size. We also explain previous join size estimators in a unified framework. All methods including ours rely on certain indexes on relations in D, which take linear time to build offline. Additionally, we extend our method using generalized hypertree decompositions (GHDs) to achieve a lower complexity than \~O(AGM/OUT) when OUT is small, and present optimization techniques for improving estimation efficiency and accuracy.
翻译:我们提出了一种新方法,用于估计大数据库 D 中小连接查询 Q 的结果数量 OUT,以及实现连接上的均匀采样。我们的方法是首个满足以下所有条件的方法:- 支持任意 Q,既可以是无环的也可以是环形的,并包含二元和非二元关系。- 始终在 \~O(AGM/OUT) 时间内以高概率保证任意小的误差,其中 AGM 是 OUT 的 AGM 上界(OUT 的上限),\~O 隐藏了输入大小的多对数因子。我们还在一个统一框架内解释了先前的连接大小估计方法。所有方法(包括我们的方法)都依赖于 D 中关系上的特定索引,这些索引可在线性时间内离线构建。此外,我们使用广义超树分解(GHD)扩展了我们的方法,当 OUT 较小时可实现低于 \~O(AGM/OUT) 的复杂度,并提出了优化技术以提高估计效率和精度。