Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods.
翻译:最近,Bessa等人(PODS 2023)证明,基于协调加权抽样的草图方法在理论上和经验上均优于流行的线性草图方法(如Johnson-Lindenstrauss投影和CountSketch)用于普遍存在的内积估计问题。我们通过引入并分析两种替代的基于抽样的方法进一步拓展了这一发现。与Bessa等人计算成本高昂的算法不同,我们的方法能以线性时间(计算草图)运行,并在实践中表现更佳,在各种任务中显著超越线性草图方法。例如,它们在估计未连接表之间的列相关性方面提供了最先进的结果——我们展示了如何将该问题以黑箱方式归约为内积估计。尽管方法基于已知的抽样技术(阈值抽样与优先抽样),我们引入了显著的新的理论分析,以证明我们的方法的近似保证。