Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduce SciNLP - a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 6,409 entities and 1,648 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing datasets show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.3 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at: https://github.com/AKADDC/SciNLP.
翻译:从科学文献中提取结构化信息对于捕捉专业领域的核心概念与新兴趋势至关重要。尽管现有数据集推动了模型发展,但由于领域复杂性和科学文本标注成本高昂,现有研究大多聚焦于特定出版物章节。为解决这一局限,我们构建了SciNLP——自然语言处理(NLP)领域中面向全文实体与关系抽取的专业基准数据集。该数据集包含60篇经过人工标注的NLP领域全文出版物,涵盖6,409个实体和1,648条关系。与现有研究相比,SciNLP是首个提供NLP领域全文实体及关系标注的数据集。为验证其有效性,我们与相似数据集开展了对比实验,并在该数据集上评估了当前最先进监督模型的性能。结果表明,现有模型对不同长度学术文本的抽取能力存在差异。与现有数据集的交叉对比显示,SciNLP在部分基线模型上实现了显著的性能提升。基于SciNLP训练的模型,我们实现了NLP领域细粒度知识图谱的自动构建。该知识图谱实体平均节点度达3.3,表明其具备丰富的语义拓扑信息,可增强下游应用。数据集已公开于:https://github.com/AKADDC/SciNLP。