Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods.
翻译:最近,Bessa等人(PODS 2023)研究表明,针对内积估计这一普遍性问题,基于协调加权采样的草图方法在理论和实证上均优于Johnson-Lindenstrauss投影和CountSketch等主流线性草图方法。我们通过引入并分析两种基于采样的替代方法,进一步推进了这一发现。与Bessa等人中计算成本高昂的算法相比,我们的方法在线性时间内(即可完成草图计算)运行,并在实际应用中表现更优,在各种任务上显著超越线性草图方法。例如,它们在估计未连接表中列间相关性这一问题上提供了最先进的结果——我们展示了如何以黑盒方式将该问题归约为内积估计。尽管基于已知采样技术(阈值采样与优先级采样),我们引入了重要的新理论分析以证明所提方法的近似保证。