Measuring the relatedness between scientific publications is essential in many areas of bibliometrics and science policy. Controlled vocabularies provide a promising basis for measuring relatedness and are widely used in combination with Salton's cosine similarity. The latter is problematic because it only considers exact matches between terms. This article introduces two alternative methods - soft cosine and maximum term similarities - that account for the semantic similarity between non-matching terms. The article compares the accuracy of all three methods using the assignment of publications to topics in the TREC 2006 Genomics Track and the assumption that accurate relatedness measures should assign high relatedness scores to publication pairs within the same topic and low scores to pairs from separate topics. Results show that soft cosine is the most accurate method, while the most widely used version of Salton's cosine is markedly less accurate than the other methods tested. These findings have implications for how controlled vocabularies should be used to measure relatedness.
翻译:在文献计量学和科学政策研究的诸多领域中,测量科学出版物之间的关联度至关重要。受控词汇表为关联度测量提供了有前景的基础,并常与Salton余弦相似度结合使用。然而,后者仅考虑术语间的精确匹配,存在明显局限。本文引入两种替代方法——软余弦相似度与最大术语相似度——这两种方法能够考量非匹配术语间的语义相似性。文章通过TREC 2006基因组学轨道中出版物与主题的匹配任务,对三种方法的准确性进行了比较;其基本假设是:准确的关联度测量方法应对同一主题内的出版物对赋予较高的关联度评分,而对不同主题的出版物对赋予较低的评分。实验结果表明,软余弦相似度是准确性最高的方法,而目前应用最广泛的Salton余弦相似度版本在准确性上显著低于其他测试方法。这些发现对于如何利用受控词汇表进行关联度测量具有重要启示。