Despite cross-lingual generalization demonstrated by pre-trained multilingual models, the translate-train paradigm of transferring English datasets across multiple languages remains to be a key mechanism for training task-specific multilingual models. However, for many low-resource languages, the availability of a reliable translation service entails significant amounts of costly human-annotated translation pairs. Further, translation services may continue to be brittle due to domain mismatch between task-specific input text and general-purpose text used for training translation models. For multilingual semantic parsing, we demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting. Through extensive comparisons on two public datasets, MTOP and MASSIVE, spanning 50 languages and several domains, we show that our method of translating data using LLMs outperforms a strong translate-train baseline on 41 out of 50 languages. We study the key design choices that enable more effective multilingual data translation via prompted LLMs.
翻译:尽管预训练的多语言模型展现了跨语言泛化能力,但将英语数据集迁移至多种语言的“翻译-训练”范式仍是训练特定任务多语言模型的关键机制。然而,对于许多低资源语言而言,获取可靠的翻译服务需要大量昂贵的人工标注翻译对。此外,由于任务特定输入文本与训练翻译模型的通用文本之间存在领域不匹配,翻译服务可能仍然脆弱。针对多语言语义解析,我们证明了通过少量样本提示,大型语言模型在将英语数据集翻译成多种语言方面的有效性和灵活性。通过对覆盖50种语言及多个领域的两个公开数据集MTOP和MASSIVE进行广泛比较,我们展示了使用大型语言模型翻译数据的方法在50种语言中的41种上优于强大的“翻译-训练”基线。我们研究了通过提示大型语言模型实现更有效的多语言数据翻译的关键设计选择。