We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular linear sketching approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Specifically, while our method admits guarantees that exactly match linear sketching for dense vectors, it yields significantly lower error for sparse vectors with limited overlap between non-zero entries. Such vectors arise in many applications involving sparse data. They are also important in increasingly popular dataset search applications, where inner product sketches are used to estimate data covariance, conditional means, and other quantities involving columns in unjoined tables. We complement our theoretical results by showing that our approach empirically outperforms existing linear sketches and unweighted hashing-based sketches for sparse vectors.
翻译:我们提出了一种计算紧凑草图的新方法,可用于近似高维向量对之间的内积。基于加权最小哈希算法,我们的方法具有强精度保证,改进了流行的内积估计线性草图方法(如CountSketch和Johnson-Lindenstrauss投影)的保证。具体而言,虽然我们的方法对稠密向量的保证与线性草图完全一致,但对于非零项重叠有限的稀疏向量,它能显著降低误差。这类向量出现在许多涉及稀疏数据的应用中,并且在日益流行的数据集搜索应用中也很重要——在此类应用中,内积草图用于估计数据协方差、条件均值以及未连接表中列的其他量。我们通过实验补充了理论结果,表明对于稀疏向量,我们的方法在经验上优于现有线性草图和基于无权重哈希的草图。