Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. Despite decades of literature on such sampling methods, this observation seems to have been overlooked. We further develop the finding by presenting and analyzing two alternative sampling-based inner product sketching methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods.
翻译:近期,Bessa等人(PODS 2023)指出,在普遍存在的内积估计问题中,基于协调加权采样的草图在理论和实践上均优于流行的线性草图方法(如Johnson-Lindenstrauss投影和CountSketch)。尽管有关此类采样方法的文献已存在数十年,但这一发现似乎一直被忽视。我们进一步推进该发现,提出并分析了两种基于采样的内积草图替代方法。与Bessa等人计算成本高昂的算法相比,我们的方法(计算草图)运行在线性时间内,且在实践表现更优,在多种任务上显著优于线性草图方法。例如,该方法为未连接表中列间相关性估计提供了最先进的结果——我们展示了如何将该问题以黑箱方式规约为内积估计。尽管基于已知采样技术(阈值采样和优先级采样),但我们引入了重要的新理论分析,以证明这些方法的近似保证。