Linking biomedical entities is an essential aspect in biomedical natural language processing tasks, such as text mining and question answering. However, a difficulty of linking the biomedical entities using current large language models (LLM) trained on a general corpus is that biomedical entities are scarcely distributed in texts and therefore have been rarely seen during training by the LLM. At the same time, those LLMs are not aware of high level semantic connection between different biomedical entities, which are useful in identifying similar concepts in different textual contexts. To cope with aforementioned problems, some recent works focused on injecting knowledge graph information into LLMs. However, former methods either ignore the relational knowledge of the entities or lead to catastrophic forgetting. Therefore, we propose a novel framework to pre-train the powerful generative LLM by a corpus synthesized from a KG. In the evaluations we are unable to confirm the benefit of including synonym, description or relational information.
翻译:生物医学实体链接是生物医学自然语言处理任务(如文本挖掘和问答)中的关键环节。然而,使用在通用语料上训练的当前大型语言模型(LLM)进行生物医学实体链接存在困难,因为生物医学实体在文本中分布稀疏,导致此类实体在LLM训练过程中极少出现。同时,这些LLM无法感知不同生物医学实体之间的高阶语义关联——而这种关联对于识别不同文本语境中的相似概念至关重要。针对上述问题,近期研究聚焦于将知识图谱信息注入LLM。然而,现有方法或忽略实体间的关联知识,或导致灾难性遗忘。为此,我们提出一种新框架,通过从知识图谱合成的语料对强生成式LLM进行预训练。评估结果表明,我们无法确认引入同义词、描述或关系信息的收益。