Document-Level Biomedical Relation Extraction (Bio-RE) aims to identify relations between biomedical entities within extensive texts, serving as a crucial subfield of biomedical text mining. Existing Bio-RE methods struggle with cross-sentence inference, which is essential for capturing relations spanning multiple sentences. Moreover, previous methods often overlook the incompleteness of documents and lack the integration of external knowledge, limiting contextual richness. Besides, the scarcity of annotated data further hampers model training. Recent advancements in large language models (LLMs) have inspired us to explore all the above issues for document-level Bio-RE. Specifically, we propose a document-level Bio-RE framework via LLM Adaptive Document-Relation Cross-Mapping (ADRCM) Fine-Tuning and Concept Unique Identifier (CUI) Retrieval-Augmented Generation (RAG). First, we introduce the Iteration-of-REsummary (IoRs) prompt for solving the data scarcity issue. In this way, Bio-RE task-specific synthetic data can be generated by guiding ChatGPT to focus on entity relations and iteratively refining synthetic data. Next, we propose ADRCM fine-tuning, a novel fine-tuning recipe that establishes mappings across different documents and relations, enhancing the model's contextual understanding and cross-sentence inference capabilities. Finally, during the inference, a biomedical-specific RAG approach, named CUI RAG, is designed to leverage CUIs as indexes for entities, narrowing the retrieval scope and enriching the relevant document contexts. Experiments conducted on three Bio-RE datasets (GDA, CDR, and BioRED) demonstrate the state-of-the-art performance of our proposed method by comparing it with other related works.
翻译:文档级生物医学关系抽取(Bio-RE)旨在从大量文本中识别生物医学实体间的关系,是生物医学文本挖掘的关键子领域。现有的Bio-RE方法在跨句子推理方面存在困难,而这对捕捉跨越多个句子的关系至关重要。此外,先前的方法常忽视文档的不完整性,且缺乏外部知识的整合,限制了上下文的丰富性。同时,标注数据的稀缺性进一步阻碍了模型训练。近期大语言模型(LLMs)的进展启发我们针对文档级Bio-RE探索上述所有问题。具体而言,我们提出了一种通过LLM自适应文档-关系交叉映射(ADRCM)微调与概念唯一标识符(CUI)检索增强生成(RAG)的文档级Bio-RE框架。首先,我们引入了关系摘要迭代(IoRs)提示,以解决数据稀缺问题。通过引导ChatGPT聚焦于实体关系并迭代优化合成数据,可生成Bio-RE任务特定的合成数据。其次,我们提出了ADRCM微调,这是一种新颖的微调方案,可在不同文档与关系间建立映射,增强模型的上下文理解与跨句子推理能力。最后,在推理阶段,设计了一种名为CUI RAG的生物医学专用RAG方法,利用CUI作为实体的索引,以缩小检索范围并丰富相关文档上下文。在三个Bio-RE数据集(GDA、CDR和BioRED)上进行的实验表明,通过与其他相关工作比较,我们提出的方法实现了最先进的性能。