Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement

from arxiv, 5 pages, 1 figure, accepted for CIKM 2023. The dataset, data construction scripts, and baseline implementation are available at https://zenodo.org/record/8228005 (Zenodo) and https://github.com/KRR-Oxford/OET (GitHub)

Mentions of new concepts appear regularly in texts and require automated approaches to harvest and place them into Knowledge Bases (KB), e.g., ontologies and taxonomies. Existing datasets suffer from three issues, (i) mostly assuming that a new concept is pre-discovered and cannot support out-of-KB mention discovery; (ii) only using the concept label as the input along with the KB and thus lacking the contexts of a concept label; and (iii) mostly focusing on concept placement w.r.t a taxonomy of atomic concepts, instead of complex concepts, i.e., with logical operators. To address these issues, we propose a new benchmark, adapting MedMentions dataset (PubMed abstracts) with SNOMED CT versions in 2014 and 2017 under the Diseases sub-category and the broader categories of Clinical finding, Procedure, and Pharmaceutical / biologic product. We provide usage on the evaluation with the dataset for out-of-KB mention discovery and concept placement, adapting recent Large Language Model based methods.

翻译：新概念提及在文本中频繁出现，需要自动化方法将其提取并置于知识库（KB，如本体和分类体系）中。现有数据集存在三个问题：(i) 大多假设新概念已被预先发现，无法支持知识库外的提及发现；(ii) 仅将概念标签与知识库作为输入，缺乏概念标签的上下文信息；(iii) 主要关注原子概念在分类体系中的放置，而非复杂概念（即包含逻辑运算符的概念）。为解决这些问题，我们提出一个新的基准数据集，基于MedMentions数据集（PubMed摘要），适配了SNOMED CT在2014年和2017年版本中疾病子类别以及临床发现、操作流程和药物/生物制品等更广泛类别。我们提供了将该数据集用于知识库外提及发现与概念放置的评估方法，并适配了基于最新大语言模型的方法。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日