Large language models (LLMs) combined with instruction tuning have shown significant progress in information extraction (IE) tasks, exhibiting strong generalization capabilities to unseen datasets by following annotation guidelines. However, their applicability to low-resource languages remains limited due to lack of both labeled data for fine-tuning, and unlabeled text for pre-training. In this paper, we propose TransFusion, a framework in which models are fine-tuned to use English translations of low-resource language data, enabling more precise predictions through annotation fusion. Based on TransFusion, we introduce GoLLIE-TF, a cross-lingual instruction-tuned LLM for IE tasks, designed to close the performance gap between high and low-resource languages. Our experiments across twelve multilingual IE datasets spanning 50 languages demonstrate that GoLLIE-TF achieves better zero-shot cross-lingual transfer over the base model. In addition, we show that TransFusion significantly improves low-resource language named entity recognition when applied to proprietary models such as GPT-4 (+5 F1) with a prompting approach, or fine-tuning different language models including decoder-only (+14 F1) and encoder-only (+13 F1) architectures.
翻译:大型语言模型(LLMs)结合指令微调在信息抽取(IE)任务中展现出显著进展,通过遵循标注规范,对未见数据集表现出强大的泛化能力。然而,由于缺乏用于微调的标注数据和用于预训练的无标注文本,其在低资源语言上的适用性仍然有限。本文提出TransFusion框架,该框架通过微调模型使其能够利用低资源语言数据的英文翻译,并通过标注融合实现更精确的预测。基于TransFusion,我们进一步提出GoLLIE-TF——一个面向IE任务的跨语言指令微调LLM,旨在缩小高资源与低资源语言之间的性能差距。我们在涵盖50种语言的12个多语言IE数据集上的实验表明,GoLLIE-TF相比基础模型实现了更优的零样本跨语言迁移性能。此外,研究还表明,当TransFusion应用于GPT-4等专有模型(通过提示方法提升+5 F1值)或微调不同架构的语言模型(包括仅解码器架构提升+14 F1值,仅编码器架构提升+13 F1值)时,能显著提升低资源语言的命名实体识别性能。