We investigate the task of inserting new concepts extracted from texts into an ontology using language models. We explore an approach with three steps: edge search which is to find a set of candidate locations to insert (i.e., subsumptions between concepts), edge formation and enrichment which leverages the ontological structure to produce and enhance the edge candidates, and edge selection which eventually locates the edge to be placed into. In all steps, we propose to leverage neural methods, where we apply embedding-based methods and contrastive learning with Pre-trained Language Models (PLMs) such as BERT for edge search, and adapt a BERT fine-tuning-based multi-label Edge-Cross-encoder, and Large Language Models (LLMs) such as GPT series, FLAN-T5, and Llama 2, for edge selection. We evaluate the methods on recent datasets created using the SNOMED CT ontology and the MedMentions entity linking benchmark. The best settings in our framework use fine-tuned PLM for search and a multi-label Cross-encoder for selection. Zero-shot prompting of LLMs is still not adequate for the task, and we propose explainable instruction tuning of LLMs for improved performance. Our study shows the advantages of PLMs and highlights the encouraging performance of LLMs that motivates future studies.
翻译:我们研究了利用语言模型将从文本中提取的新概念插入到本体中的任务。我们探索了一种三步法:边搜索,即寻找一组候选位置进行插入(即概念间的包含关系);边形成与增强,利用本体结构生成并优化候选边;边选择,最终确定要放置的边。在所有步骤中,我们提出利用神经方法,其中在边搜索阶段应用基于嵌入的方法以及基于预训练语言模型(如BERT)的对比学习;在边选择阶段,则采用基于BERT微调的多标签边交叉编码器以及大型语言模型(如GPT系列、FLAN-T5和Llama 2)。我们使用基于SNOMED CT本体和MedMentions实体链接基准创建的最新数据集对方法进行评估。我们框架中的最佳设置为:使用微调后的预训练语言模型进行搜索,以及使用多标签交叉编码器进行选择。大型语言模型的零样本提示方法仍不足以胜任该任务,我们提出通过可解释的指令微调来提升大型语言模型的性能。我们的研究展示了预训练语言模型的优势,并突出体现了大型语言模型令人鼓舞的表现,这为未来研究提供了动力。