Uniform sampling and approximate counting are fundamental primitives for modern database applications, ranging from query optimization to approximate query processing. While recent breakthroughs have established optimal sampling and counting algorithms for full join queries, a significant gap remains for join-project queries, which are ubiquitous in real-world workloads. The state-of-the-art ``propose-and-verify'' framework \cite{chen2020random} for these queries suffers from fundamental inefficiencies, often yielding prohibitive complexity when projections significantly reduce the output size. In this paper, we present the first asymptotically optimal algorithms for fundamental classes of join-project queries, including matrix, star, and chain queries. By leveraging a novel rejection-based sampling strategy and a hybrid counting reduction, we achieve polynomial speedups over the state of the art. We establish the optimality of our results through matching communication complexity lower bounds, which hold even against algebraic techniques like fast matrix multiplication. Finally, we delineate the theoretical limits of the problem space. While matrix and star queries admit efficient sublinear-time algorithms, we establish a significantly stronger lower bound for chain queries, demonstrating that sublinear algorithms are impossible in general.
翻译:等概率采样与近似计数是现代数据库应用的基础原语,其应用范围涵盖查询优化至近似查询处理。尽管近期突破性研究已为完全连接查询建立了最优采样与计数算法,但在实际工作负载中无处不在的连接-投影查询领域仍存在显著空白。当前最先进的"提议-验证"框架\cite{chen2020random}在处理此类查询时存在本质性低效问题,当投影操作显著缩减输出规模时往往产生难以承受的复杂度。本文针对包括矩阵查询、星型查询和链式查询在内的基础连接-投影查询类别,首次提出渐近最优算法。通过采用创新的基于拒绝的采样策略与混合计数归约方法,我们在现有最优技术上实现了多项式级加速。通过构建匹配的通信复杂度下界(该下界甚至对快速矩阵乘法等代数技术依然成立),我们证明了所提结果的最优性。最后,我们界定了该问题空间的理论极限:虽然矩阵查询与星型查询允许高效的亚线性时间算法,但我们为链式查询建立了显著更强的下界,证明亚线性算法在一般情况下不可实现。