Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents - 专知论文

会员服务 ·

0

三元 · 三元组 · 知识 · 嵌入 · 结构 ·

2025 年 12 月 19 日

Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents

翻译：基于三元组与知识注入嵌入的科学文献聚类与分类方法

The increasing volume and complexity of scientific literature demand robust methods for organizing and understanding research documents. In this study, we explore how structured knowledge, specifically, subject-predicate-object triples, can enhance the clustering and classification of scientific papers. We propose a modular pipeline that combines unsupervised clustering and supervised classification over multiple document representations: raw abstracts, extracted triples, and hybrid formats that integrate both. Using a filtered arXiv corpus, we extract relational triples from abstracts and construct four text representations, which we embed using four state-of-the-art transformer models: MiniLM, MPNet, SciBERT, and SPECTER. We evaluate the resulting embeddings with KMeans, GMM, and HDBSCAN for unsupervised clustering, and fine-tune classification models for arXiv subject prediction. Our results show that full abstract text yields the most coherent clusters, but that hybrid representations incorporating triples consistently improve classification performance, reaching up to 92.6% accuracy and 0.925 macro-F1. We also find that lightweight sentence encoders (MiniLM, MPNet) outperform domain-specific models (SciBERT, SPECTER) in clustering, while SciBERT excels in structured-input classification. These findings highlight the complementary benefits of combining unstructured text with structured knowledge, offering new insights into knowledge-infused representations for semantic organization of scientific documents.

翻译：科学文献数量与复杂性的日益增长，亟需稳健的方法来组织和理解研究文档。本研究探讨结构化知识——特别是主语-谓语-宾语三元组——如何提升科学论文的聚类与分类效果。我们提出一种模块化流程，该流程结合了无监督聚类与有监督分类方法，并针对多种文档表示形式进行分析：原始摘要、提取的三元组以及融合二者的混合形式。通过使用过滤后的arXiv语料库，我们从摘要中提取关系三元组，构建了四种文本表示形式，并采用四种前沿的Transformer模型进行嵌入：MiniLM、MPNet、SciBERT和SPECTER。我们使用KMeans、GMM和HDBSCAN对生成的嵌入进行无监督聚类评估，并通过微调分类模型进行arXiv学科类别预测。实验结果表明，完整摘要文本能产生最连贯的聚类簇，但融合三元组的混合表示形式能持续提升分类性能，最高达到92.6%的准确率与0.925的宏平均F1值。我们还发现，在聚类任务中，轻量级句子编码器（MiniLM、MPNet）优于领域专用模型（SciBERT、SPECTER），而在结构化输入分类任务中SciBERT表现最佳。这些发现凸显了非结构化文本与结构化知识结合的互补优势，为科学文献语义组织的知识注入表示方法提供了新的见解。

0

相关内容

基于深度学习的中文文本分类综述

基于深度学习的中文文本分类综述

专知会员服务

25+阅读 · 2024年5月9日

电子科大最新《深度聚类》全面综述，20页pdf涵盖260篇文献全面阐述深度聚类方法

电子科大最新《深度聚类》全面综述，20页pdf涵盖260篇文献全面阐述深度聚类方法

专知会员服务

109+阅读 · 2022年10月16日

浙江大学等最新《深度聚类》综述，，35页pdf涵盖246篇文献概述深度聚类体系挑战与未来方向

浙江大学等最新《深度聚类》综述，，35页pdf涵盖246篇文献概述深度聚类体系挑战与未来方向

专知会员服务

132+阅读 · 2022年6月20日

复旦大学邱锡鹏等《自然语言处理范式迁移综述》论文，详述7大NLP范式：分类、匹配、SeqLab, MRC, Seq2Seq等

专知会员服务

54+阅读 · 2021年9月29日

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

专知会员服务

27+阅读 · 2020年7月24日

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

专知会员服务

49+阅读 · 2020年5月26日

【Snapchat-谷歌-微软】最新《深度学习文本分类》2020综述论文大全，150+DL分类模型，42页pdf215篇参考文献

【Snapchat-谷歌-微软】最新《深度学习文本分类》2020综述论文大全，150+DL分类模型，42页pdf215篇参考文献

专知会员服务

84+阅读 · 2020年4月9日

【论文】知识图嵌入的群表示理论（Group Representation Theory for Knowledge Graph Embedding），俄亥俄州立大学| Chen Cai

【论文】知识图嵌入的群表示理论（Group Representation Theory for Knowledge Graph Embedding），俄亥俄州立大学| Chen Cai

专知会员服务

31+阅读 · 2019年12月30日

【ECML-PKDD 2019】可解释序列分类的背景知识注入（Background Knowledge Injection forInterpretable Sequence Classification）

【ECML-PKDD 2019】可解释序列分类的背景知识注入（Background Knowledge Injection forInterpretable Sequence Classification）

专知会员服务

15+阅读 · 2019年12月3日

【元学习 | 论文】元学习聚类，Meta-Learning to Cluster，哥伦比亚大学

【元学习 | 论文】元学习聚类，Meta-Learning to Cluster，哥伦比亚大学

专知会员服务

42+阅读 · 2019年11月21日

【资源】元学习论文分类列表推荐

【资源】元学习论文分类列表推荐

专知

19+阅读 · 2019年12月3日

【论文笔记】基于文本语料库中分类法学习的综述：问题、资源和最新进展

【论文笔记】基于文本语料库中分类法学习的综述：问题、资源和最新进展

专知

12+阅读 · 2019年10月13日

五年12篇顶会论文综述！一文读懂深度学习文本分类方法

五年12篇顶会论文综述！一文读懂深度学习文本分类方法

AI100

10+阅读 · 2019年6月5日

【资源推荐】元学习（meta-learning）相关文献资源大列表

【资源推荐】元学习（meta-learning）相关文献资源大列表

专知

25+阅读 · 2019年3月6日

【GitHub项目推荐】文本分类最好的几个深度学习方法 TensorFlow 实践

【GitHub项目推荐】文本分类最好的几个深度学习方法 TensorFlow 实践

专知

39+阅读 · 2018年11月27日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

文本分类又来了，用 Scikit-Learn 解决多类文本分类问题

文本分类又来了，用 Scikit-Learn 解决多类文本分类问题

AI研习社

14+阅读 · 2018年7月22日

基于深度学习的文本分类6大算法-原理、结构、论文、源码打包分享

基于深度学习的文本分类6大算法-原理、结构、论文、源码打包分享

深度学习与NLP

25+阅读 · 2018年7月18日

读书报告 | Deep Learning for Extreme Multi-label Text Classification

读书报告 | Deep Learning for Extreme Multi-label Text Classification

科技创新与创业

48+阅读 · 2018年1月10日

fastText、TextCNN、TextRNN…这套NLP文本分类深度学习方法库供你选择

fastText、TextCNN、TextRNN…这套NLP文本分类深度学习方法库供你选择

数据派THU

29+阅读 · 2017年8月2日

图文混合跨媒体知识单元的模糊分类方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

多标记文本数据流分类方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

模糊认知集群优化的聚类算法

国家自然科学基金

9+阅读 · 2015年12月31日

面向异构信息网络中实体归类的模糊聚类

国家自然科学基金

1+阅读 · 2015年12月31日

基于聚类分析的高性能包分类技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

谱聚类在多个网络模块识别中的推广及在生物网络中的应用

国家自然科学基金

1+阅读 · 2014年12月31日

面向地图综合的多尺度空间聚类理论与方法

国家自然科学基金

1+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

面向社会化媒体异构大数据的快速组合聚类研究

国家自然科学基金

1+阅读 · 2014年12月31日

时间序列数据挖掘中的聚类模型与算法研究

国家自然科学基金

14+阅读 · 2008年12月31日

Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

Arxiv

0+阅读 · 2月18日

Improved Approximation Algorithms for Relational Clustering

Arxiv

0+阅读 · 2月17日

Sample-Efficient "Clustering and Conquer" Procedures for Parallel Large-Scale Ranking and Selection

Arxiv

0+阅读 · 2月13日

Modelling and Classifying the Components of a Literature Review

Arxiv

0+阅读 · 2月9日

Enhancing Academic Paper Recommendations Using Fine-Grained Knowledge Entities and Multifaceted Document Embeddings

Arxiv

0+阅读 · 1月27日

Large-Scale Multidimensional Knowledge Profiling of Scientific Literature

Arxiv

0+阅读 · 1月21日

Graphical model-based clustering of categorical data

Arxiv

0+阅读 · 1月21日

SASA: Semantic-Aware Contrastive Learning Framework with Separated Attention for Triple Classification

Arxiv

0+阅读 · 1月19日

Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering

Arxiv

0+阅读 · 1月14日

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Arxiv

10+阅读 · 2018年8月29日

VIP会员

文章信息

相关主题

最新内容

以色列军事技术对美国军力发展的持续性赋能

以色列军事技术对美国军力发展的持续性赋能

专知会员服务

5+阅读 · 今天8:46

战场之外的较量：美伊冲突中的认知战与心理博弈

战场之外的较量：美伊冲突中的认知战与心理博弈

专知会员服务

4+阅读 · 今天7:41

俄乌战争中乌克兰防空能力演变与见解（中文版）

俄乌战争中乌克兰防空能力演变与见解（中文版）

专知会员服务

2+阅读 · 今天7:22

《面向巡飞弹药系统的情境感知深度强化学习自主非线性机动控制》

《面向巡飞弹药系统的情境感知深度强化学习自主非线性机动控制》

专知会员服务

6+阅读 · 今天6:04

《深度强化学习在兵棋推演中的应用》40页报告

《深度强化学习在兵棋推演中的应用》40页报告

专知会员服务

8+阅读 · 今天5:37

《多域作战面临复杂现实》

《多域作战面临复杂现实》

专知会员服务

6+阅读 · 今天5:35

《印度的多域作战：条令与能力发展》报告

《印度的多域作战：条令与能力发展》报告

专知会员服务

2+阅读 · 今天5:24

《是“修复情报”还是修复部队？阿富汗反叛乱行动中的美军情报调整》400页

《是“修复情报”还是修复部队？阿富汗反叛乱行动中的美军情报调整》400页

专知会员服务

2+阅读 · 今天5:18

美军的算法化军备库：无人机优势计划（DDP）、复制者倡议（Replicator）与联合全域指挥控制（JADC2）如何重写战争规则

美军的算法化军备库：无人机优势计划（DDP）、复制者倡议（Replicator）与联合全域指挥控制（JADC2）如何重写战争规则

专知会员服务

2+阅读 · 今天3:25

（中文版）美空军部发布《空军部数据战略》与《人工智能战略》两份战略：旨在加速建立军事优势

（中文版）美空军部发布《空军部数据战略》与《人工智能战略》两份战略：旨在加速建立军事优势

专知会员服务

15+阅读 · 今天2:55

【斯坦福博士论文】语言模型的机械可解释性与控制

【斯坦福博士论文】语言模型的机械可解释性与控制

专知会员服务

3+阅读 · 4月23日

大语言模型智能体长期记忆安全性综述：迈向记忆主权

大语言模型智能体长期记忆安全性综述：迈向记忆主权

专知会员服务

4+阅读 · 4月23日

美军被摧毁的空战装备：伊朗战争如何重创美国空中力量

美军被摧毁的空战装备：伊朗战争如何重创美国空中力量

专知会员服务

4+阅读 · 4月23日

人工智能赋能无人机：俄乌战争（万字长文）

人工智能赋能无人机：俄乌战争（万字长文）

专知会员服务

7+阅读 · 4月23日

国外海军作战管理系统与作战训练系统

国外海军作战管理系统与作战训练系统

专知会员服务

3+阅读 · 4月23日

相关VIP内容

基于深度学习的中文文本分类综述

基于深度学习的中文文本分类综述

专知会员服务

25+阅读 · 2024年5月9日

电子科大最新《深度聚类》全面综述，20页pdf涵盖260篇文献全面阐述深度聚类方法

电子科大最新《深度聚类》全面综述，20页pdf涵盖260篇文献全面阐述深度聚类方法

专知会员服务

109+阅读 · 2022年10月16日

浙江大学等最新《深度聚类》综述，，35页pdf涵盖246篇文献概述深度聚类体系挑战与未来方向

浙江大学等最新《深度聚类》综述，，35页pdf涵盖246篇文献概述深度聚类体系挑战与未来方向

专知会员服务

132+阅读 · 2022年6月20日

复旦大学邱锡鹏等《自然语言处理范式迁移综述》论文，详述7大NLP范式：分类、匹配、SeqLab, MRC, Seq2Seq等

专知会员服务

54+阅读 · 2021年9月29日

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

专知会员服务

27+阅读 · 2020年7月24日

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

【论文推荐】层次知识图谱，Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

专知会员服务

49+阅读 · 2020年5月26日

【Snapchat-谷歌-微软】最新《深度学习文本分类》2020综述论文大全，150+DL分类模型，42页pdf215篇参考文献

【Snapchat-谷歌-微软】最新《深度学习文本分类》2020综述论文大全，150+DL分类模型，42页pdf215篇参考文献

专知会员服务

84+阅读 · 2020年4月9日

【论文】知识图嵌入的群表示理论（Group Representation Theory for Knowledge Graph Embedding），俄亥俄州立大学| Chen Cai

【论文】知识图嵌入的群表示理论（Group Representation Theory for Knowledge Graph Embedding），俄亥俄州立大学| Chen Cai

专知会员服务

31+阅读 · 2019年12月30日

【ECML-PKDD 2019】可解释序列分类的背景知识注入（Background Knowledge Injection forInterpretable Sequence Classification）

【ECML-PKDD 2019】可解释序列分类的背景知识注入（Background Knowledge Injection forInterpretable Sequence Classification）

专知会员服务

15+阅读 · 2019年12月3日

【元学习 | 论文】元学习聚类，Meta-Learning to Cluster，哥伦比亚大学

【元学习 | 论文】元学习聚类，Meta-Learning to Cluster，哥伦比亚大学

专知会员服务

42+阅读 · 2019年11月21日

热门VIP内容

开通专知VIP会员享更多权益服务

战场之外的较量：美伊冲突中的认知战与心理博弈

《面向巡飞弹药系统的情境感知深度强化学习自主非线性机动控制》

以色列军事技术对美国军力发展的持续性赋能

俄乌战争中乌克兰防空能力演变与见解（中文版）

相关资讯

【资源】元学习论文分类列表推荐

【资源】元学习论文分类列表推荐

专知

19+阅读 · 2019年12月3日

【论文笔记】基于文本语料库中分类法学习的综述：问题、资源和最新进展

【论文笔记】基于文本语料库中分类法学习的综述：问题、资源和最新进展

专知

12+阅读 · 2019年10月13日

五年12篇顶会论文综述！一文读懂深度学习文本分类方法

五年12篇顶会论文综述！一文读懂深度学习文本分类方法

AI100

10+阅读 · 2019年6月5日

【资源推荐】元学习（meta-learning）相关文献资源大列表

【资源推荐】元学习（meta-learning）相关文献资源大列表

专知

25+阅读 · 2019年3月6日

【GitHub项目推荐】文本分类最好的几个深度学习方法 TensorFlow 实践

【GitHub项目推荐】文本分类最好的几个深度学习方法 TensorFlow 实践

专知

39+阅读 · 2018年11月27日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

文本分类又来了，用 Scikit-Learn 解决多类文本分类问题

文本分类又来了，用 Scikit-Learn 解决多类文本分类问题

AI研习社

14+阅读 · 2018年7月22日

基于深度学习的文本分类6大算法-原理、结构、论文、源码打包分享

基于深度学习的文本分类6大算法-原理、结构、论文、源码打包分享

深度学习与NLP

25+阅读 · 2018年7月18日

读书报告 | Deep Learning for Extreme Multi-label Text Classification

读书报告 | Deep Learning for Extreme Multi-label Text Classification

科技创新与创业

48+阅读 · 2018年1月10日

fastText、TextCNN、TextRNN…这套NLP文本分类深度学习方法库供你选择

fastText、TextCNN、TextRNN…这套NLP文本分类深度学习方法库供你选择

数据派THU

29+阅读 · 2017年8月2日

相关论文

Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

Arxiv

0+阅读 · 2月18日

Improved Approximation Algorithms for Relational Clustering

Arxiv

0+阅读 · 2月17日

Sample-Efficient "Clustering and Conquer" Procedures for Parallel Large-Scale Ranking and Selection

Arxiv

0+阅读 · 2月13日

Modelling and Classifying the Components of a Literature Review

Arxiv

0+阅读 · 2月9日

Enhancing Academic Paper Recommendations Using Fine-Grained Knowledge Entities and Multifaceted Document Embeddings

Arxiv

0+阅读 · 1月27日

Large-Scale Multidimensional Knowledge Profiling of Scientific Literature

Arxiv

0+阅读 · 1月21日

Graphical model-based clustering of categorical data

Arxiv

0+阅读 · 1月21日

SASA: Semantic-Aware Contrastive Learning Framework with Separated Attention for Triple Classification

Arxiv

0+阅读 · 1月19日

Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering

Arxiv

0+阅读 · 1月14日

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Arxiv

10+阅读 · 2018年8月29日

相关基金

图文混合跨媒体知识单元的模糊分类方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

多标记文本数据流分类方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

模糊认知集群优化的聚类算法

国家自然科学基金

9+阅读 · 2015年12月31日

面向异构信息网络中实体归类的模糊聚类

国家自然科学基金

1+阅读 · 2015年12月31日

基于聚类分析的高性能包分类技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

谱聚类在多个网络模块识别中的推广及在生物网络中的应用

国家自然科学基金

1+阅读 · 2014年12月31日

面向地图综合的多尺度空间聚类理论与方法

国家自然科学基金

1+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

面向社会化媒体异构大数据的快速组合聚类研究

国家自然科学基金

1+阅读 · 2014年12月31日

时间序列数据挖掘中的聚类模型与算法研究

国家自然科学基金

14+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员