Clustering of publication networks is an efficient way to obtain classifications of large collections of research publications. Such classifications can be used to, e.g., detect research topics, normalize citation relations, or explore the publication output of a unit. Citation networks can be created using a variety of approaches. Best practices to obtain classifications using clustering have been investigated, in particular the performance of different publication-publication relatedness measures. However, evaluation of different approaches to normalization of citation relations have not been explored to the same extent. In this paper, we evaluate five approaches to normalization of direct citation relations with respect to clustering solution quality in four data sets. A sixth approach is evaluated using no normalization. To assess the quality of clustering solutions, we use three measures. (1) We compare the clustering solution to the reference lists of a set of publications using the Adjusted Rand Index. (2) Using the Sihouette width measure, we quantity to which extent the publications have relations to other clusters than the one they have been assigned to. (3) We propose a measure that captures publications that have probably been inaccurately assigned. The results clearly show that normalization is preferred over unnormalized direct citation relations. Furthermore, the results indicate that the fractional normalization approach, which can be considered the standard approach, causes inaccurate assignments. The geometric normalization approach has a similar performance as the fractional approach regarding Adjusted Rand Index and Silhouette width but leads to fewer inaccurate assignments. We therefore believe that the geometric approach may be preferred over the fractional approach.
翻译:对出版网络进行聚类分析是一种高效获取大规模科研文献分类结果的方法。此类分类可用于检测研究主题、归一化引用关系或探索某个科研单位的论文产出。构建引用网络有多种方式。关于利用聚类获得分类的最佳实践已有研究,特别是不同出版物间相关性指标的性能比较。然而,针对引用关系归一化方法的评估尚未得到同等程度的探讨。本文在四个数据集中评估了五种直接引用关系归一化方法对聚类解质量的影响,另加入未经归一化的第六种方法作为对照。采用三种指标评估聚类解质量:(1)通过调整兰德指数将聚类解与一组出版物的参考文献列表进行对比;(2)利用轮廓宽度量化各出版物与其被分配簇之外其他簇的关联程度;(3)提出一种捕获可能被错误分配出版物的指标。结果表明,对直接引用关系进行归一化处理优于未经归一化的方法。此外,作为标准方法的分数归一化可能导致错误分配。几何归一化方法在调整兰德指数和轮廓宽度方面表现与分数法相近,但产生的错误分配更少。因此我们认为几何归一化法可能优于分数法。