Large language models can perform well on general natural language tasks, but their effectiveness is still not optimal for information extraction. Recent works indicate that the main reason lies in the lack of extensive data on information extraction instructions. Note that the existing datasets on information extraction instructions not only have limited coverage but also involve high construction costs. To address this issue, we introduce InstructIE, a bilingual instruction-based information extraction dataset, which covers 12 diverse domains. Specifically, we propose KG2Instruction, a framework specifically for the automatic generation of such datasets. Experimental results demonstrate that large language models trained with InstructIE can not only obtain better information extraction capabilities but also enhance zero-shot performance compared with baselines.
翻译:大型语言模型在通用自然语言任务上表现良好,但在信息抽取任务中其有效性仍不理想。近期研究表明,主要原因在于缺乏大规模的信息抽取指令数据。值得注意的是,现有信息抽取指令数据集不仅覆盖范围有限,而且构建成本高昂。为解决这一问题,我们提出了InstructIE——一个覆盖12个不同领域的基于指令的双语信息抽取数据集。具体而言,我们提出了KG2Instruction框架,专门用于此类数据集的自动生成。实验结果表明,与基线模型相比,使用InstructIE训练的大型语言模型不仅能获得更优的信息抽取能力,还能提升零样本性能。