Multilingual Pretrained Language Models (MPLMs) have shown their strong multilinguality in recent empirical cross-lingual transfer studies. In this paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC) pipeline to improve the zero-shot performance on low-resource languages (LRLs) by augmenting the context with semantically similar sentences retrieved from a high-resource language (HRL) as prompts. PARC improves the zero-shot performance on three downstream tasks (binary sentiment classification, topic categorization and natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in both unlabeled settings (+5.1%) and labeled settings (+16.3%). PARC-labeled also outperforms the finetuning baseline by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between the high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.
翻译:多语言预训练语言模型(MPLMs)在近期跨语言迁移学习的实证研究中展现出强大的多语言能力。本文提出跨语言检索增强提示(PARC)流程,通过从高资源语言中检索语义相似的句子作为提示来增强上下文,从而提升低资源语言的零样本性能。在涵盖6个语系的10种低资源语言的多语言平行测试集上,PARC在三个下游任务(二元情感分类、主题分类和自然语言推理)中均提升了零样本性能:无标注设置下提升5.1%,有标注设置下提升16.3%。标注版PARC比微调基线高出3.7%。我们发现跨语言迁移性能与高资源语言和低资源语言之间的相似性以及低资源预训练数据的数量之间存在显著正相关。鲁棒性分析表明,借助更强的MPLMs,PARC有潜力实现更出色的性能。