Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.
翻译:针对视觉计算领域常用的特征——共现矩阵(COM)和游程长度矩阵(RLM)进行适应性改进,提出用于通用字符串(词、短语、代码及文本)相似度计算的统计特征。所提特征对语言相关信息不敏感,纯属统计性质,可适用于任何语言及语法结构场景。本文还对领域中其他常用统计度量(如最长公共子序列、最大连续最长公共子序列、互信息和编辑距离)进行了评估与比较。在首轮合成实验数据集上,COM和RLM特征表现优于其余现有最优统计特征。在4组实验中的3组中,RLM和COM特征在统计显著性上优于基于距离的第二优特征组(P值<0.001)。在真实文本抄袭数据集上,RLM特征获得了最优结果。