Traditional information extraction (IE) methodologies, constrained by pre-defined classes and static training paradigms, often falter in adaptability, especially in the dynamic world. To bridge this gap, we explore an instruction-based IE paradigm in this paper, leveraging the substantial cross-task generalization capabilities of Large Language Models (LLMs). We observe that most existing IE datasets tend to be overly redundant in their label sets, which leads to the inclusion of numerous labels not directly relevant to the extraction content when constructing instructions. To tackle this issue, we introduce a bilingual theme-centric IE instruction dataset (Chinese and English), InstructIE, and for the first time, incorporate a theme scheme design that effectively simplifies the label structure. Furthermore, we develop an innovative framework named KG2Instruction, which is specifically designed for the automatic generation of such datasets. Experimental evaluations based on InstructIE reveal that while current models show promise in Instruction-based IE tasks, opportunities for their potential optimization also emerge. The dataset is available at https://huggingface.co/datasets/zjunlp/InstructIE.
翻译:传统信息抽取(IE)方法受限于预定义类别和静态训练范式,在动态世界中往往适应性不足。为弥补这一差距,本文探索了一种基于指令的信息抽取范式,利用大语言模型(LLMs)强大的跨任务泛化能力。我们观察到,现有大多数信息抽取数据集的标签集过于冗余,导致在构建指令时会包含大量与抽取内容无直接相关的标签。为解决此问题,我们引入了一个双语主题中心型信息抽取指令数据集(中文和英文)——InstructIE,并首次融入主题方案设计,有效简化了标签结构。此外,我们开发了一个名为KG2Instruction的创新框架,专门用于此类数据集的自动生成。基于InstructIE的实验评估表明,当前模型在基于指令的IE任务中虽展现出潜力,但也暴露了潜在优化空间。该数据集发布于https://huggingface.co/datasets/zjunlp/InstructIE。