Citation graphs are fundamental tools for modeling scientific structure, but are often fragmented due to missing citations of scientifically connected articles. To address this issue, we propose a computationally efficient hybrid framework integrating citation topology with large language model (LLM)-based text similarity. Using 662,369 Web of Science publications in Mathematics and Operations Research & Management Science, we augment the original graph by adding semantic edges from small, disconnected components and weighting existing citations according to textual similarity. Semantic augmentation substantially reduces fragmentation while preserving disciplinary homogeneity. Compared to embedding-only clustering, cluster detection on augmented graphs using the Leiden algorithm retains structural interpretability while offering multi-scale organization. The method scales efficiently to large datasets and offers a practical strategy for strengthening citation-based indicators without collapsing disciplinary boundaries.
翻译:引文图是建模科学结构的基础工具,但由于科学关联文献间存在引用缺失,其常呈现碎片化特征。针对该问题,我们提出一种计算高效的混合框架,融合了引文拓扑结构与基于大语言模型(LLM)的文本相似性。通过使用Web of Science数据库中数学、运筹学与管理科学领域的662,369篇论文,我们从小型孤立组件中引入语义边对原始图进行增强,并根据文本相似度对现有引用进行加权。语义增强在保持学科同质性的同时显著降低了碎片化程度。相较于仅依赖嵌入表示的聚类方法,采用Leiden算法对增强图进行聚类检测,既能保留结构可解释性,又能实现多尺度组织划分。该方法可高效扩展至大规模数据集,为强化引文指标提供实用策略,且不会模糊学科边界。