Mentions of new concepts appear regularly in texts and require automated approaches to harvest and place them into Knowledge Bases (KB), e.g., ontologies and taxonomies. Existing datasets suffer from three issues, (i) mostly assuming that a new concept is pre-discovered and cannot support out-of-KB mention discovery; (ii) only using the concept label as the input along with the KB and thus lacking the contexts of a concept label; and (iii) mostly focusing on concept placement w.r.t a taxonomy of atomic concepts, instead of complex concepts, i.e., with logical operators. To address these issues, we propose a new benchmark, adapting MedMentions dataset (PubMed abstracts) with SNOMED CT versions in 2014 and 2017 under the Diseases sub-category and the broader categories of Clinical finding, Procedure, and Pharmaceutical / biologic product. We provide usage on the evaluation with the dataset for out-of-KB mention discovery and concept placement, adapting recent Large Language Model based methods.
翻译:新概念提及频繁出现在文本中,需要自动化方法将其获取并纳入知识库(如本体论和分类体系)。现有数据集存在三个问题:(i)大多假设新概念已预先发现,无法支持知识库外提及的发现;(ii)仅将概念标签与知识库一同作为输入,缺乏概念标签的上下文信息;(iii)大多聚焦于原子概念(而非复合概念,即带逻辑运算符的概念)在分类体系中的定位。为解决这些问题,我们提出一个新基准,基于MedMentions数据集(PubMed摘要),适配2014年和2017年版SNOMED CT中的疾病子类别,以及临床发现、程序和制药/生物制品等更广类别。我们提供了该数据集在评估知识库外提及发现和概念定位任务中的使用方法,并适配了基于大语言模型的最新方法。