SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP

Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduce SciNLP - a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 6,409 entities and 1,648 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing datasets show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.3 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at: https://github.com/AKADDC/SciNLP.

翻译：从科学文献中提取结构化信息对于捕捉专业领域的核心概念与新兴趋势至关重要。尽管现有数据集推动了模型发展，但由于领域复杂性和科学文本标注成本高昂，现有研究大多聚焦于特定出版物章节。为解决这一局限，我们构建了SciNLP——自然语言处理（NLP）领域中面向全文实体与关系抽取的专业基准数据集。该数据集包含60篇经过人工标注的NLP领域全文出版物，涵盖6,409个实体和1,648条关系。与现有研究相比，SciNLP是首个提供NLP领域全文实体及关系标注的数据集。为验证其有效性，我们与相似数据集开展了对比实验，并在该数据集上评估了当前最先进监督模型的性能。结果表明，现有模型对不同长度学术文本的抽取能力存在差异。与现有数据集的交叉对比显示，SciNLP在部分基线模型上实现了显著的性能提升。基于SciNLP训练的模型，我们实现了NLP领域细粒度知识图谱的自动构建。该知识图谱实体平均节点度达3.3，表明其具备丰富的语义拓扑信息，可增强下游应用。数据集已公开于：https://github.com/AKADDC/SciNLP。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【Arizona博士论文】可解释自然语言处理及其在信息抽取中的应用，125页pdf

专知会员服务

36+阅读 · 2023年3月2日

复旦大学邱锡鹏等《自然语言处理范式迁移综述》论文，详述7大NLP范式：分类、匹配、SeqLab, MRC, Seq2Seq等

专知会员服务

54+阅读 · 2021年9月29日

万字综述，GNN在NLP中的应用，建议收藏慢慢看

专知会员服务

59+阅读 · 2021年6月22日

【NAACL2021】Graph4NLP：图深度学习自然语言处理，附239页ppt

专知会员服务

106+阅读 · 2021年6月12日