Named Entity Recognition (NER) models play a crucial role in various NLP tasks, including information extraction (IE) and text understanding. In academic writing, references to machine learning models and datasets are fundamental components of various computer science publications and necessitate accurate models for identification. Despite the advancements in NER, existing ground truth datasets do not treat fine-grained types like ML model and model architecture as separate entity types, and consequently, baseline models cannot recognize them as such. In this paper, we release a corpus of 100 manually annotated full-text scientific publications and a first baseline model for 10 entity types centered around ML models and datasets. In order to provide a nuanced understanding of how ML models and datasets are mentioned and utilized, our dataset also contains annotations for informal mentions like "our BERT-based model" or "an image CNN". You can find the ground truth dataset and code to replicate model training at https://data.gesis.org/gsap/gsap-ner.
翻译:命名实体识别(NER)模型在信息抽取和文本理解等自然语言处理任务中扮演着关键角色。在学术写作中,对机器学习模型和数据集的引用是各类计算机科学出版物的重要组成部分,需要精确的识别模型。尽管NER技术已取得进展,现有基准数据集并未将机器学习模型和模型架构等细粒度类型区分为独立实体类别,因此基线模型无法对其进行有效识别。本文发布了由100篇人工标注的全文学术文献构成的语料库,以及首个面向机器学习模型与数据集相关的10种实体类型的基线模型。为深入理解机器学习模型与数据集的提及与使用方式,我们的数据集还包含对非正式提及(如"our BERT-based model"或"an image CNN")的标注。您可访问https://data.gesis.org/gsap/gsap-ner 获取基准数据集及模型训练复现代码。