A multiple k-means cluster ensemble framework for clustering citation trajectories

Citation maturity time varies for different articles. However, the impact of all articles is measured in a fixed window. Clustering their citation trajectories helps understand the knowledge diffusion process and reveals that not all articles gain immediate success after publication. Moreover, clustering trajectories is necessary for paper impact recommendation algorithms. It is a challenging problem because citation time series exhibit significant variability due to non linear and non stationary characteristics. Prior works propose a set of arbitrary thresholds and a fixed rule based approach. All methods are primarily parameter dependent. Consequently, it leads to inconsistencies while defining similar trajectories and ambiguities regarding their specific number. Most studies only capture extreme trajectories. Thus, a generalised clustering framework is required. This paper proposes a feature based multiple k means cluster ensemble framework. 1,95,783 and 41,732 well cited articles from the Microsoft Academic Graph data are considered for clustering short term (10 year) and long term (30 year) trajectories, respectively. It has linear run time. Four distinct trajectories are obtained Early Rise Rapid Decline (2.2%), Early Rise Slow Decline (45%), Delayed Rise No Decline (53%), and Delayed Rise Slow Decline (0.8%). Individual trajectory differences for two different spans are studied. Most papers exhibit Early Rise Slow Decline and Delayed Rise No Decline patterns. The growth and decay times, cumulative citation distribution, and peak characteristics of individual trajectories are redefined empirically. A detailed comparative study reveals our proposed methodology can detect all distinct trajectory classes.

翻译：不同文章的引文成熟时间各异。然而，所有文章的影响力均在固定时间窗口内进行衡量。对其引文轨迹进行聚类有助于理解知识扩散过程，并揭示并非所有文章在发表后都能立即获得成功。此外，聚类轨迹对于论文影响力推荐算法而言是必要的。由于引文时间序列因非线性与非平稳特征而呈现显著变异性，该问题颇具挑战。先前研究提出了一套基于任意阈值和固定规则的方法，所有方法主要依赖于参数设置。这导致在定义相似轨迹时出现不一致性，且对具体轨迹数量产生歧义。多数研究仅能捕捉极端轨迹。因此，需要一种通用的聚类框架。本文提出了一种基于特征的多重K均值聚类集成框架。以微软学术图谱中的1,95,783和41,732篇高被引文章为数据，分别针对短期（10年）与长期（30年）轨迹进行聚类。该方法具有线性运行时间。获得了四种不同的轨迹类型：早期上升快速下降（2.2%）、早期上升缓慢下降（45%）、延迟上升无下降（53%）以及延迟上升缓慢下降（0.8%）。研究还分析了两种不同时间跨度下个体轨迹的差异。大多数论文表现为早期上升缓慢下降和延迟上升无下降模式。本文基于经验重新定义了各轨迹的增长与衰减时间、累积引文分布及峰值特征。详细的比较研究表明，所提方法能够检测出所有不同的轨迹类别。