Uniform sampling and approximate counting are fundamental primitives for modern database applications, ranging from query optimization to approximate query processing. While recent breakthroughs have established optimal sampling and counting algorithms for full join queries, a significant gap remains for join-project queries, which are ubiquitous in real-world workloads. The state-of-the-art ``propose-and-verify'' framework \cite{chen2020random} for these queries suffers from fundamental inefficiencies, often yielding prohibitive complexity when projections significantly reduce the output size. In this paper, we present the first asymptotically optimal algorithms for fundamental classes of join-project queries, including matrix, star, and chain queries. By leveraging a novel rejection-based sampling strategy and a hybrid counting reduction, we achieve polynomial speedups over the state of the art. We establish the optimality of our results through matching communication complexity lower bounds, which hold even against algebraic techniques like fast matrix multiplication. Finally, we delineate the theoretical limits of the problem space. While matrix and star queries admit efficient sublinear-time algorithms, we establish a significantly stronger lower bound for chain queries, demonstrating that sublinear algorithms are impossible in general.
翻译:均匀采样与近似计数是现代数据库应用中的基本原语,涵盖从查询优化到近似查询处理等场景。尽管近期突破性进展已为全连接查询建立了最优采样与计数算法,但在实际负载中普遍存在的连接-投影查询仍存在显著差距。当前针对此类查询的最先进"提议-验证"框架\cite{chen2020random}存在根本性效率缺陷,当投影操作大幅缩减输出规模时,往往导致复杂度难以承受。本文首次针对连接-投影查询的基础类别(包括矩阵查询、星型查询和链式查询)提出渐近最优算法。通过采用基于拒绝采样的新型策略与混合计数归约方法,我们实现了相较于现有技术的多项式级别加速。通过匹配通信复杂度的下界(该下界即使面对快速矩阵乘法等代数技术依然成立),我们证明了结果的渐近最优性。最后,我们刻画了该问题空间的理论边界:尽管矩阵与星型查询可设计高效次线性时间算法,但针对链式查询我们证明了更强的下界,表明通用情况下无法实现次线性算法。