Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. Despite decades of literature on such sampling methods, this observation seems to have been overlooked. We further develop the finding by presenting and analyzing two alternative sampling-based inner product sketching methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods.
翻译:最近,Bessa等人(PODS 2023)的研究表明,基于协调加权抽样的草图在内积估计这一普遍问题的理论与实证中,均优于流行的线性草绘方法(如Johnson-Lindentrauss投影和CountSketch)。尽管此类抽样方法已有数十年的文献积累,这一观察结果似乎一直未被重视。我们进一步推进了这一发现,提出并分析了两种基于抽样的替代性内积草绘方法。与Bessa等人计算成本高昂的算法相比,我们的方法在线性时间内完成草图计算,且在多项任务中表现更佳,显著优于线性草绘。例如,在无连接表格的列间相关性估计问题中,我们展示了如何通过黑盒方式将其归约为内积估计,并提供了当前最优的结果。虽然本方法基于已知的抽样技术(阈值抽样与优先级抽样),但我们引入了全新的理论分析,以证明其近似保证。