Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents

The increasing volume and complexity of scientific literature demand robust methods for organizing and understanding research documents. In this study, we investigate whether structured knowledge, specifically, subject-predicate-object triples-improves clustering and classification of scientific papers. We present a modular pipeline that combines unsupervised clustering and supervised classification across four document representations: abstract, triples, abstract+triples, and hybrid. Using a filtered arXiv corpus, we evaluate four transformer embeddings (MiniLM, MPNet, SciBERT, SPECTER) with KMeans, GMM, and HDBSCAN, and then train downstream classifiers for subject prediction. Across a five-seed benchmark (seeds 40-44), abstract-only inputs provide the strongest and most stable classification performance, reaching 0.923 accuracy and 0.923 macro-F1 (mean). Triple-only and knowledge-infused variants do not consistently outperform this baseline. In clustering, KMeans/GMM generally outperform HDBSCAN on external validity metrics, while HDBSCAN exhibits higher noise sensitivity. We observe that adding extracted triples naively does not guarantee gains and can reduce performance depending on representation choice. These results refine the role of knowledge infusion in scientific document modeling: structured triples are informative but not universally beneficial, and their impact is strongly configuration-dependent. Our findings provide a reproducible benchmark and practical guidance for when knowledge-augmented representations help, and when strong text-only baselines remain preferable.

翻译：科技文献数量的增长与复杂性的提升，亟需稳健的方法来组织与理解研究文档。本研究探讨结构化知识——具体而言，即主语-谓语-宾语三元组——是否能改善科技论文的聚类与分类效果。我们提出了一种模块化流程，该流程结合了无监督聚类与有监督分类，在四种文档表示（摘要、三元组、摘要+三元组、混合表示）上进行实验。基于过滤后的arXiv语料库，我们评估了四种Transformer嵌入（MiniLM、MPNet、SciBERT、SPECTER），结合KMeans、GMM与HDBSCAN算法，并训练下游分类器进行主题预测。在五种子实验基准（种子40-44）中，仅使用摘要作为输入取得了最强且最稳定的分类性能，准确率达到0.923，宏平均F1值为0.923。仅使用三元组或知识注入变体并未持续超越该基线。在聚类方面，KMeans/GMM在外部有效性指标上普遍优于HDBSCAN，而HDBSCAN则表现出更高的噪声敏感性。我们观察到，简单添加提取的三元组并不能保证性能提升，且根据表示选择的不同，甚至可能降低性能。这些结果细化了知识注入在科技文档建模中的作用：结构化三元组虽具信息价值，但并非普遍有益，其影响高度依赖具体配置。本研究成果为知识增强表示何时有效、何时强文本基线仍更优提供了可重复的基准与实用指导。