Large language models (LLMs) usually fall short on information extraction (IE) tasks and struggle to follow the complex instructions of IE tasks. This primarily arises from LLMs not being aligned with humans, as mainstream alignment datasets typically do not include IE data. In this paper, we introduce ADELIE (Aligning large language moDELs on Information Extraction), an aligned LLM that effectively solves various IE tasks, including closed IE, open IE, and on-demand IE. We first collect and construct a high-quality alignment corpus IEInstruct for IE. Then we train ADELIE_SFT using instruction tuning on IEInstruct. We further train ADELIE_SFT with direct preference optimization (DPO) objective, resulting in ADELIE_DPO. Extensive experiments on various held-out IE datasets demonstrate that our models (ADELIE_SFT and ADELIE_DPO) achieve state-of-the-art (SoTA) performance among open-source models. We further explore the general capabilities of ADELIE, and experimental results reveal that their general capabilities do not exhibit a noticeable decline. We will release the code, data, and models to facilitate further research.
翻译:大型语言模型通常难以胜任信息抽取任务,并难以遵循信息抽取任务的复杂指令。这主要源于大型语言模型未与人类对齐——当前主流对齐数据集通常不包含信息抽取数据。本文提出ADELIE(面向信息抽取的大型语言模型对齐),这是一种能有效解决各类信息抽取任务的对齐模型,包括封闭式信息抽取、开放式信息抽取和按需信息抽取。我们首先构建了高质量的信息抽取对齐语料库IEInstruct,随后通过指令微调训练出ADELIE_SFT,并进一步采用直接偏好优化目标训练得到ADELIE_DPO。在多种未见过的信息抽取数据集上的大量实验表明,我们的模型(ADELIE_SFT和ADELIE_DPO)在开源模型中取得了最先进的性能。我们进一步探索了ADELIE的通用能力,实验结果显示其通用能力未出现明显下降。我们将公开代码、数据和模型以促进后续研究。