Large language models (LLMs) are very proficient text generators. We leverage this capability of LLMs to generate task-specific data via zero-shot prompting and promote cross-lingual transfer for low-resource target languages. Given task-specific data in a source language and a teacher model trained on this data, we propose using this teacher to label LLM generations and employ a set of simple data selection strategies that use the teacher's label probabilities. Our data selection strategies help us identify a representative subset of diverse generations that help boost zero-shot accuracies while being efficient, in comparison to using all the LLM generations (without any subset selection). We also highlight other important design choices that affect cross-lingual performance such as the use of translations of source data and what labels are best to use for the LLM generations. We observe significant performance gains across sentiment analysis and natural language inference tasks (of up to a maximum of 7.13 absolute points and 1.5 absolute points on average) across a number of target languages (Hindi, Marathi, Urdu, Swahili) and domains.
翻译:大型语言模型(LLM)具备出色的文本生成能力。我们利用LLM的零样本提示生成任务特定数据,以促进低资源目标语言的跨语言迁移。给定源语言中的任务特定数据及基于该数据训练的教师模型,我们提出使用该教师模型对LLM生成内容进行标注,并采用一组基于教师模型标签概率的简易数据选择策略。相较于使用全部LLM生成内容(未经任何子集选择),我们的数据选择策略能够识别出具有代表性的多样化生成子集,在保证效率的同时显著提升零样本准确率。我们还重点探讨了影响跨语言性能的其他关键设计选择,例如源数据翻译的使用方式以及LLM生成内容的最佳标注策略。在情感分析和自然语言推理任务中,我们在多种目标语言(印地语、马拉地语、乌尔都语、斯瓦希里语)和领域上观察到显著的性能提升(最高提升7.13个绝对百分点,平均提升1.5个绝对百分点)。