Clustering of publication networks is an efficient way to obtain classifications of large collections of research publications. Such classifications can be used to, e.g., detect research topics, normalize citation relations, or explore the publication output of a unit. Citation networks can be created using a variety of approaches. Best practices to obtain classifications using clustering have been investigated, in particular the performance of different publication-publication relatedness measures. However, evaluation of different approaches to normalization of citation relations have not been explored to the same extent. In this paper, we evaluate five approaches to normalization of direct citation relations with respect to clustering solution quality in four data sets. A sixth approach is evaluated using no normalization. To assess the quality of clustering solutions, we use three measures. (1) We compare the clustering solution to the reference lists of a set of publications using the Adjusted Rand Index. (2) Using the Sihouette width measure, we quantity to which extent the publications have relations to other clusters than the one they have been assigned to. (3) We propose a measure that captures publications that have probably been inaccurately assigned. The results clearly show that normalization is preferred over unnormalized direct citation relations. Furthermore, the results indicate that the fractional normalization approach, which can be considered the standard approach, causes inaccurate assignments. The geometric normalization approach has a similar performance as the fractional approach regarding Adjusted Rand Index and Silhouette width but leads to fewer inaccurate assignments. We therefore believe that the geometric approach may be preferred over the fractional approach.
翻译:出版物网络聚类是获取大规模科研出版物分类的有效途径。此类分类可用于检测研究主题、归一化引文关系或探索机构的出版产出。引文网络可通过多种方法构建。关于利用聚类获取分类的最佳实践已有研究,尤其是不同出版物间关联度量的性能比较。然而,引文关系归一化方法评估方面的研究尚不充分。本文在四个数据集中评估了五种直接引文关系归一化方法对聚类解质量的影响,第六种方法采用无归一化处理。我们使用三种指标评估聚类解质量:(1)通过调整兰德指数将聚类解与一组出版物的参考文献列表进行比较;(2)利用轮廓系数量化出版物被分配至非所属聚类的程度;(3)提出一种捕获可能被错误分配出版物的指标。结果表明,归一化处理优于未归一化的直接引文关系。此外,分数归一化方法(可视为标准方法)会导致错误分配。几何归一化方法在调整兰德指数和轮廓系数方面表现与分数法相近,但能减少错误分配数量。因此,我们认为几何归一化法可能优于分数归一化法。