Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimental results on LLaMA, Baichuan and Qwen demonstrate that using IEPile can enhance the performance of LLMs for IE, especially the zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.
翻译:大型语言模型(LLMs)在各领域展现出显著潜力,但在信息抽取(IE)任务上仍存在明显性能差距。高质量指令数据是提升LLMs特定能力的关键,然而现有IE数据集普遍存在规模小、碎片化且缺乏标准化模式的问题。为此,我们提出IEPile——一个包含约0.32B tokens的综合双语(英文和中文)IE指令语料库。通过收集和清洗33个现有IE数据集,并引入基于模式的指令生成方法,我们构建了该大规模语料库。在LLaMA、Baichuan和Qwen上的实验结果表明,使用IEPile可提升LLMs在IE任务上的性能,尤其在零样本泛化方面表现突出。我们将资源与预训练模型开源,期望为自然语言处理领域提供有力支持。