Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate results in a Web search, for example, a common approach looks at the Jaccard index between all pairs of pages. In social network analysis, a much-celebrated metric is the Adamic-Adar index, widely used to compare node neighborhood sets in the important problem of predicting links. However, with the increasing amount of data to be processed, calculating the exact similarity between all pairs can be intractable. The challenge of working at this scale has motivated research into efficient estimators for set similarity metrics. The two most popular estimators, MinHash and SimHash, are indeed used in applications such as document deduplication and recommender systems where large volumes of data need to be processed. Given the importance of these tasks, the demand for advancing estimators is evident. We propose DotHash, an unbiased estimator for the intersection size of two sets. DotHash can be used to estimate the Jaccard index and, to the best of our knowledge, is the first method that can also estimate the Adamic-Adar index and a family of related metrics. We formally define this family of metrics, provide theoretical bounds on the probability of estimate errors, and analyze its empirical performance. Our experimental results indicate that DotHash is more accurate than the other estimators in link prediction and detecting duplicate documents with the same complexity and similar comparison time.
翻译:集合相似度度量是多项数据挖掘任务的核心要素。例如,在Web搜索中去除重复结果时,常用方法是计算所有页面对的杰卡德指数。在社交网络分析中,一个备受推崇的度量是Adamic-Adar指数,该指数被广泛用于链接预测这一重要问题中的节点邻域集合比较。然而,随着待处理数据量日益增长,计算所有集合对之间的精确相似度变得难以实现。大规模数据处理带来的挑战促使研究者探索集合相似度度量的高效估计方法。两种最流行的估计方法——MinHash和SimHash——确实被应用于需要处理海量数据的文档去重和推荐系统等场景。鉴于这些任务的重要性,对更先进估计方法的需求显而易见。我们提出了DotHash,一种用于估计两个集合交集大小的无偏估计器。DotHash可用于估计杰卡德指数,并且据我们所知,它是首个能同时估计Adamic-Adar指数及一系列相关度量的方法。本文正式定义了该度量族,给出了估计误差概率的理论界,并分析了其经验性能。实验结果表明,在相同的计算复杂度和相近的比较时间下,DotHash在链接预测和重复文档检测中的准确率均优于其他估计方法。