The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus for this task. DMDD consists of the DMDD main corpus, comprising 31,219 scientific articles with over 449,000 dataset mentions weakly annotated in the format of in-text spans, and an evaluation set, which comprises of 450 scientific articles manually annotated for evaluation purposes. We use DMDD to establish baseline performance for dataset mention detection and linking. By analyzing the performance of various models on DMDD, we are able to identify open problems in dataset mention detection. We invite the community to use our dataset as a challenge to develop novel dataset mention detection models.
翻译:数据集名称识别是科学文献自动信息抽取中的关键任务,有助于研究者理解研究现状并发现研究机遇。然而,现有用于数据集提及检测的语料库在规模和命名多样性上存在局限。本文提出数据集提及检测数据集(DMDD),这是当前该领域最大的公开可用语料库。DMDD包含主语料库(含31,219篇科学论文,采用文本跨度标注形式弱标注了超过449,000个数据集提及)以及评估集(由450篇科学论文组成,由人工标注用于评估)。我们利用DMDD建立了数据集提及检测与链接的基准性能。通过分析不同模型在DMDD上的表现,我们识别出数据集提及检测领域尚待解决的开放性问题。诚邀学界将本数据集作为挑战任务,开发新型数据集提及检测模型。