We present the parametric method SemSimp aimed at measuring semantic similarity of digital resources. SemSimp is based on the notion of information content, and it leverages a reference ontology and taxonomic reasoning, encompassing different approaches for weighting the concepts of the ontology. In particular, weights can be computed by considering either the available digital resources or the structure of the reference ontology of a given domain. SemSimp is assessed against six representative semantic similarity methods for comparing sets of concepts proposed in the literature, by carrying out an experimentation that includes both a statistical analysis and an expert judgement evaluation. To the purpose of achieving a reliable assessment, we used a real-world large dataset based on the Digital Library of the Association for Computing Machinery (ACM), and a reference ontology derived from the ACM Computing Classification System (ACM-CCS). For each method, we considered two indicators. The first concerns the degree of confidence to identify the similarity among the papers belonging to some special issues selected from the ACM Transactions on Information Systems journal, the second the Pearson correlation with human judgement. The results reveal that one of the configurations of SemSimp outperforms the other assessed methods. An additional experiment performed in the domain of physics shows that, in general, SemSimp provides better results than the other similarity methods.
翻译:我们提出了参数化方法SemSimp,旨在测量数字资源的语义相似度。SemSimp基于信息内容的概念,利用参考本体和分类学推理,整合了多种为本体概念赋权的不同方法。具体而言,权重既可通过考虑现有数字资源来计算,也可依据特定领域参考本体的结构进行确定。通过开展包含统计分析和专家判断评估的实验,我们将SemSimp与文献中提出的六种代表性集合概念语义相似度方法进行了对比评估。为实现可靠评估,我们使用了基于美国计算机学会数字图书馆的真实大规模数据集,以及由美国计算机学会计算分类系统派生的参考本体。针对每种方法,我们考虑了两个指标:其一为识别选自《美国计算机学会信息系统汇刊》期刊特刊论文之间相似度的置信度,其二为与人类判断的皮尔逊相关系数。结果表明,SemSimp的一种配置性能优于其他被评估方法。在物理学领域进行的额外实验表明,总体而言,SemSimp提供的效果优于其他相似度方法。