Biomedical triple extraction systems aim to automatically extract biomedical entities and relations between entities. While current unified information extraction models showcase state-of-the-art performance, they face challenges in understanding relationships between entities within intricate biomedical sentences. Furthermore, the absence of a high-quality biomedical triple extraction dataset impedes the progress in developing robust triple extraction systems. To tackle these challenges, we propose a novel retrieval-based framework for biomedical triple extraction, namely PeTailor, which explicitly retrieves the relevant document from our pre-built diverse chunk database using a novel tailored chunk scorer and integrates the retrieved information into the input of a Large Language Model (LLM) to generate the corresponding triple (head entity, relation, tail entity) for the input sentence. Additionally, we present GM-CIHT, an expert-annotated biomedical triple extraction dataset that covers a wider range of relation types. Experimental results show that our proposed PeTailor method achieves state-of-the-art performance on GM-CIHT and two standard biomedical triple extraction datasets
翻译:生物医学三元组抽取系统旨在自动抽取生物医学实体及实体间关系。当前统一信息抽取模型虽展现出最优性能,但在理解复杂生物医学语句中实体间关系方面仍面临挑战。此外,高质量生物医学三元组数据集的缺乏阻碍了鲁棒性三元组抽取系统的发展。针对这些问题,我们提出一种新型基于检索的生物医学三元组抽取框架PeTailor,该框架通过定制化分块评分器从预构建的多源分块数据库中显式检索相关文档,并将检索信息融入大语言模型输入,进而生成输入语句对应的三元组(头实体,关系,尾实体)。同时,我们发布了专家标注的生物医学三元组数据集GM-CIHT,该数据集涵盖更广泛的关系类型。实验结果表明,所提出的PeTailor方法在GM-CIHT及两个标准生物医学三元组数据集上均实现了最优性能。