Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks

Information Extraction (IE) is crucial for converting unstructured data into structured formats like Knowledge Graphs (KGs). A key task within IE is Relation Extraction (RE), which identifies relationships between entities in text. Various RE methods exist, including supervised, unsupervised, weakly supervised, and rule-based approaches. Recent studies leveraging pre-trained language models (PLMs) have shown significant success in this area. In the current era dominated by Large Language Models (LLMs), fine-tuning these models can overcome limitations associated with zero-shot LLM prompting-based RE methods, especially regarding domain adaptation challenges and identifying implicit relations between entities in sentences. These implicit relations, which cannot be easily extracted from a sentence's dependency tree, require logical inference for accurate identification. This work explores the performance of fine-tuned LLMs and their integration into the Retrieval Augmented-based (RAG) RE approach to address the challenges of identifying implicit relations at the sentence level, particularly when LLMs act as generators within the RAG framework. Empirical evaluations on the TACRED, TACRED-Revisited (TACREV), Re-TACRED, and SemEVAL datasets show significant performance improvements with fine-tuned LLMs, including Llama2-7B, Mistral-7B, and T5 (Large). Notably, our approach achieves substantial gains on SemEVAL, where implicit relations are common, surpassing previous results on this dataset. Additionally, our method outperforms previous works on TACRED, TACREV, and Re-TACRED, demonstrating exceptional performance across diverse evaluation scenarios.

翻译：信息抽取（IE）对于将非结构化数据转换为知识图谱（KG）等结构化格式至关重要。关系抽取（RE）作为IE的核心任务，旨在识别文本中实体间的关系。现有RE方法包括监督式、无监督式、弱监督式及基于规则的方法。近期研究显示，利用预训练语言模型（PLM）在该领域取得了显著成功。在当前以大语言模型（LLM）为主导的时代，通过对LLM进行微调，可以克服基于零样本LLM提示的RE方法在领域适应挑战及识别句子中实体间隐含关系方面的局限性。这些隐含关系无法直接从句子的依存树中提取，需要逻辑推理才能准确识别。本研究探讨了微调LLM的性能及其与基于检索增强生成（RAG）的RE方法的整合，以解决句子层面隐含关系识别的挑战，特别是在LLM作为RAG框架中生成器的情况下。在TACRED、TACRED-Revisited（TACREV）、Re-TACRED和SemEVAL数据集上的实证评估表明，采用Llama2-7B、Mistral-7B和T5（Large）等微调LLM带来了显著的性能提升。值得注意的是，我们的方法在隐含关系常见的SemEVAL数据集上取得了大幅性能增益，超越了该数据集上的既往最佳结果。此外，本方法在TACRED、TACREV和Re-TACRED数据集上均优于先前工作，展现了在不同评估场景下的卓越性能。