Cross-lingual open information extraction aims to extract structured information from raw text across multiple languages. Previous work uses a shared cross-lingual pre-trained model to handle the different languages but underuses the potential of the language-specific representation. In this paper, we propose an effective multi-stage tuning framework called MT4CrossIE, designed for enhancing cross-lingual open information extraction by injecting language-specific knowledge into the shared model. Specifically, the cross-lingual pre-trained model is first tuned in a shared semantic space (e.g., embedding matrix) in the fixed encoder and then other components are optimized in the second stage. After enough training, we freeze the pre-trained model and tune the multiple extra low-rank language-specific modules using mixture-of-LoRAs for model-based cross-lingual transfer. In addition, we leverage two-stage prompting to encourage the large language model (LLM) to annotate the multi-lingual raw data for data-based cross-lingual transfer. The model is trained with multi-lingual objectives on our proposed dataset OpenIE4++ by combing the model-based and data-based transfer techniques. Experimental results on various benchmarks emphasize the importance of aggregating multiple plug-in-and-play language-specific modules and demonstrate the effectiveness of MT4CrossIE in cross-lingual OIE\footnote{\url{https://github.com/CSJianYang/Multilingual-Multimodal-NLP}}.
翻译:跨语言开放信息抽取旨在从多语言原始文本中抽取结构化信息。现有工作采用共享的跨语言预训练模型处理不同语言,但未能充分挖掘语言特异性表示的潜力。本文提出一种名为MT4CrossIE的高效多阶段调优框架,通过向共享模型中注入语言特异性知识来增强跨语言开放信息抽取。具体而言,首先在固定编码器的共享语义空间(如嵌入矩阵)中对跨语言预训练模型进行微调,随后在第二阶段优化其余组件。经过充分训练后,冻结预训练模型,并通过混合LoRA技术(mixture-of-LoRAs)对多个附加的低秩语言特异性模块进行调优,实现基于模型的跨语言迁移。此外,我们采用两阶段提示策略驱动大语言模型(LLM)对多语言原始数据进行标注,实现基于数据的跨语言迁移。模型在本文构建的OpenIE4++数据集上结合基于模型与基于数据的迁移技术,通过多语言目标函数进行训练。多组基准实验的结果凸显了聚合多个即插即用语言特异性模块的重要性,并验证了MT4CrossIE在跨语言开放信息抽取任务中的有效性\footnote{\url{https://github.com/CSJianYang/Multilingual-Multimodal-NLP}}。