Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimental results on LLaMA and Baichuan demonstrate that using IEPile can enhance the performance of LLMs for IE, especially the zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.
翻译:大型语言模型(LLMs)在多个领域展现出显著潜力,但在信息抽取(IE)方面仍存在重大性能差距。值得注意的是,高质量指令数据是提升LLMs特定能力的关键,而当前IE数据集通常规模小、碎片化且缺乏标准化模式。为此,我们提出IEPile,一个全面的双语(英文和中文)IE指令语料库,包含约0.32B个令牌。我们通过收集和清洗33个现有IE数据集构建IEPile,并引入基于模式的指令生成来发掘大规模语料库。在LLaMA和Baichuan上的实验结果表明,使用IEPile能提升LLMs在IE方面的性能,特别是零样本泛化能力。我们开源了该资源和预训练模型,希望为NLP社区提供有价值的支持。