Cross-lingual open information extraction aims to extract structured information from raw text across multiple languages. Previous work uses a shared cross-lingual pre-trained model to handle the different languages but underuses the potential of the language-specific representation. In this paper, we propose an effective multi-stage tuning framework called MT4CrossIE, designed for enhancing cross-lingual open information extraction by injecting language-specific knowledge into the shared model. Specifically, the cross-lingual pre-trained model is first tuned in a shared semantic space (e.g., embedding matrix) in the fixed encoder and then other components are optimized in the second stage. After enough training, we freeze the pre-trained model and tune the multiple extra low-rank language-specific modules using mixture-of-LoRAs for model-based cross-lingual transfer. In addition, we leverage two-stage prompting to encourage the large language model (LLM) to annotate the multi-lingual raw data for data-based cross-lingual transfer. The model is trained with multi-lingual objectives on our proposed dataset OpenIE4++ by combing the model-based and data-based transfer techniques. Experimental results on various benchmarks emphasize the importance of aggregating multiple plug-in-and-play language-specific modules and demonstrate the effectiveness of MT4CrossIE in cross-lingual OIE\footnote{\url{https://github.com/CSJianYang/Multilingual-Multimodal-NLP}}.
翻译:跨语言开放信息抽取旨在从多语言原始文本中提取结构化信息。以往研究采用共享的跨语言预训练模型处理不同语言,但未能充分利用语言特定表示的潜力。本文提出一种名为MT4CrossIE的高效多阶段调优框架,通过将语言特定知识注入共享模型,增强跨语言开放信息抽取能力。具体而言,首先在固定编码器中跨语言预训练模型于共享语义空间(如嵌入矩阵)进行调优,随后在第二阶段优化其他组件。经过充分训练后,冻结预训练模型,并利用混合LoRA技术(mixture-of-LoRAs)对多个额外低秩语言特定模块进行调优,实现基于模型的跨语言迁移。此外,我们采用两阶段提示策略驱动大语言模型(LLM)标注多语言原始数据,实现基于数据的跨语言迁移。模型在自建数据集OpenIE4++上结合基于模型与基于数据的迁移技术,通过多语言目标函数进行训练。多组基准实验结果表明,聚合多个即插即用语言特定模块至关重要,同时验证了MT4CrossIE在跨语言开放信息抽取中的有效性\footnote{\url{https://github.com/CSJianYang/Multilingual-Multimodal-NLP}}。