In their seminal work, Broder \textit{et. al.}~\citep{BroderCFM98} introduces the $\mathrm{minHash}$ algorithm that computes a low-dimensional sketch of high-dimensional binary data that closely approximates pairwise Jaccard similarity. Since its invention, $\mathrm{minHash}$ has been commonly used by practitioners in various big data applications. Further, the data is dynamic in many real-life scenarios, and their feature sets evolve over time. We consider the case when features are dynamically inserted and deleted in the dataset. We note that a naive solution to this problem is to repeatedly recompute $\mathrm{minHash}$ with respect to the updated dimension. However, this is an expensive task as it requires generating fresh random permutations. To the best of our knowledge, no systematic study of $\mathrm{minHash}$ is recorded in the context of dynamic insertion and deletion of features. In this work, we initiate this study and suggest algorithms that make the $\mathrm{minHash}$ sketches adaptable to the dynamic insertion and deletion of features. We show a rigorous theoretical analysis of our algorithms and complement it with extensive experiments on several real-world datasets. Empirically we observe a significant speed-up in the running time while simultaneously offering comparable performance with respect to running $\mathrm{minHash}$ from scratch. Our proposal is efficient, accurate, and easy to implement in practice.
翻译:在他们具有里程碑意义的工作中,Broder等人~\citep{BroderCFM98} 提出了$\mathrm{minHash}$算法,该算法计算高维二进制数据的低维摘要,并能够紧密近似成对Jaccard相似度。自其创立以来,$\mathrm{minHash}$已被实践者广泛用于各类大数据应用。此外,在许多现实场景中,数据是动态变化的,其特征集随时间演进。我们考虑在数据集中动态插入和删除特征的场景。我们注意到,该问题的一个朴素解法是针对更新后的维度反复重新计算$\mathrm{minHash}$。然而,这是一项昂贵的任务,因为它需要生成全新的随机排列。据我们所知,目前尚未有关于$\mathrm{minHash}$在特征动态插入与删除场景下的系统性研究。在本工作中,我们开创了此项研究,并提出使$\mathrm{minHash}$摘要能够自适应于特征动态插入和删除的算法。我们展示了算法的严格理论分析,并在多个真实数据集上辅以大量实验验证。实验表明,与从头运行$\mathrm{minHash}$相比,我们的方法在运行时间上实现了显著加速,同时提供了可比的性能。所提方案高效、准确且易于实际部署。