Identifying document similarity has many applications, e.g., source code analysis or plagiarism detection. However, identifying similarities is not trivial and can be time complex. For instance, the Levenshtein Distance is a common metric to define the similarity between two documents but has quadratic runtime which makes it impractical for large documents where large starts with a few hundred kilobytes. In this paper, we present a novel concept that allows estimating the Levenshtein Distance: the algorithm first compresses documents to signatures (similar to hash values) using a user-defined compression ratio. Signatures can then be compared against each other (some constrains apply) where the outcome is the estimated Levenshtein Distance. Our evaluation shows promising results in terms of runtime efficiency and accuracy. In addition, we introduce a significance score allowing examiners to set a threshold and identify related documents.
翻译:文档相似度识别具有广泛应用,例如源代码分析或剽窃检测。然而,识别相似性并非易事且可能具有较高的时间复杂度。例如,莱文斯坦距离是定义两个文档相似度的常用度量标准,但其二次方时间复杂度使得该方法在处理数几百KB以上的大型文档时变得不切实际。本文提出一种新颖概念,可实现对莱文斯坦距离的估计:该算法首先利用用户定义的压缩比率将文档压缩为签名(类似哈希值),随后可对签名进行相互比较(需满足特定约束条件),其输出即为莱文斯坦距离的估计值。实验评估表明,该方法在运行效率与准确性方面均展现出良好效果。此外,我们引入显著性评分机制,使检测者能够设定阈值以识别相关文档。