Clustering of publication networks is an efficient way to obtain classifications of large collections of research publications. Such classifications can be used to, e.g., detect research topics, normalize citation relations, or explore the publication output of a unit. Citation networks can be created using a variety of approaches. Best practices to obtain classifications using clustering have been investigated, in particular the performance of different publication-publication relatedness measures. However, evaluation of different approaches to normalization of citation relations have not been explored to the same extent. In this paper, we evaluate five approaches to normalization of direct citation relations with respect to clustering solution quality in four data sets. A sixth approach is evaluated using no normalization. To assess the quality of clustering solutions, we use three measures. (1) We compare the clustering solution to the reference lists of a set of publications using the Adjusted Rand Index. (2) Using the Sihouette width measure, we quantity to which extent the publications have relations to other clusters than the one they have been assigned to. (3) We propose a measure that captures publications that have probably been inaccurately assigned. The results clearly show that normalization is preferred over unnormalized direct citation relations. Furthermore, the results indicate that the fractional normalization approach, which can be considered the standard approach, causes inaccurate assignments. The geometric normalization approach has a similar performance as the fractional approach regarding Adjusted Rand Index and Silhouette width but leads to fewer inaccurate assignments. We therefore believe that the geometric approach may be preferred over the fractional approach.
翻译:对出版网络进行聚类是获取大规模研究出版物分类的有效途径。此类分类可用于检测研究主题、归一化引用关系或探索某机构的出版成果。引用网络可通过多种方法构建。目前已有研究探讨了利用聚类获取分类的最佳实践,特别是不同出版物间关联度量指标的性能。然而,不同引用关系归一化方法的评估尚未得到同等程度的探讨。本文在四个数据集上评估了五种直接引用关系归一化方法对聚类解质量的影响,并采用第六种无归一化方法作为对照。为评估聚类解质量,我们使用三种度量指标:(1)通过调整兰德指数(Adjusted Rand Index)将聚类解与一组出版物的参考文献列表进行比较;(2)利用轮廓宽度(Silhouette width)量化出版物与其它聚类(而非其分配到的聚类)存在关联的程度;(3)提出一种用于识别可能被错误分配的出版物的度量指标。结果清晰表明,归一化方法优于非归一化直接引用关系。此外,结果指出被视为标准方法的分数归一化(fractional normalization)会导致错误分配。几何归一化(geometric normalization)方法在调整兰德指数和轮廓宽度方面表现与分数方法类似,但产生的错误分配更少。因此,我们认为几何方法可能优于分数方法。