Recent approaches to multilingual open-domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a \emph{few-shot learning} approach to synthesise large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, \textsc{FsModQA}, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a \emph{cross-lingual prompting} strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.
翻译:近年来,多语言开放域问答(MLODQA)方法在具备充足语言特定训练数据的情况下取得了有希望的结果。然而,高昂的标注成本限制了这些方法在资源匮乏语言上的应用。我们引入一种少样本学习方法,利用大语言模型(LLMs)合成大规模多语言数据。我们的方法首先使用WikiData进行大规模自监督预训练,随后在由LLMs通过少样本监督提示生成的高质量合成多语言数据上进行训练。最终模型 \textsc{FsModQA} 在MLODQA以及跨语言和单语言检索任务上显著优于现有的少样本和监督基线方法。我们进一步证明,通过仅使用英语监督数据的跨语言提示策略,我们的方法可以有效地扩展到对新语言的零样本适应,这使其成为一种无需昂贵大规模标注即可适用于MLODQA任务的通用且实用的解决方案。