SSP: Self-Supervised Prompting for Cross-Lingual Transfer to Low-Resource Languages using Large Language Models

Recently, very large language models (LLMs) have shown exceptional performance on several English NLP tasks with just in-context learning (ICL), but their utility in other languages is still underexplored. We investigate their effectiveness for NLP tasks in low-resource languages (LRLs), especially in the setting of zero-labelled cross-lingual transfer (0-CLT), where no labelled training data for the target language is available -- however training data from one or more related medium-resource languages (MRLs) is utilized, alongside the available unlabeled test data for a target language. We introduce Self-Supervised Prompting (SSP), a novel ICL approach tailored for the 0-CLT setting. SSP is based on the key observation that LLMs output more accurate labels if in-context exemplars are from the target language (even if their labels are slightly noisy). To operationalize this, since target language training data is not available in 0-CLT, SSP operates in two stages. In Stage I, using source MRL training data, target language's test data is noisily labeled. In Stage II, these noisy test data points are used as exemplars in ICL for further improved labelling. Additionally, our implementation of SSP uses a novel Integer Linear Programming (ILP)-based exemplar selection that balances similarity, prediction confidence (when available) and label coverage. Experiments on three tasks and eleven LRLs (from three regions) demonstrate that SSP strongly outperforms existing SOTA fine-tuned and prompting-based baselines in 0-CLT setup.

翻译：近年来，超大规模语言模型仅通过上下文学习就在多项英语自然语言处理任务中展现出卓越性能，但其在其他语言中的应用潜力仍未得到充分探索。本文研究此类模型在低资源语言自然语言处理任务中的有效性，重点关注零标注跨语言迁移这一特定场景——该场景下目标语言无标注训练数据可用，但可利用一个或多个相关中等资源语言的训练数据，并结合目标语言的未标注测试数据。我们提出自监督提示方法，这是一种专为零标注跨语言迁移场景设计的新型上下文学习方法。该方法基于关键发现：当上下文示例来自目标语言时，大语言模型能输出更准确的标签（即使这些标签存在轻微噪声）。为实现此机制，由于零标注跨语言迁移中缺乏目标语言训练数据，自监督提示方法采用两阶段流程：第一阶段利用源中等资源语言训练数据对目标语言测试数据进行带噪标注；第二阶段将这些带噪测试数据作为上下文学习中的示例，以进一步提升标注质量。此外，我们的方法实现采用基于整数线性规划的新型示例选择策略，该策略综合考虑示例相似度、预测置信度（当可用时）与标签覆盖度三个维度。在三个任务和十一种低资源语言上的实验表明，在零标注跨语言迁移设置下，自监督提示方法显著优于当前最先进的微调基线与基于提示的基线方法。