Large-scale multilingual machine translation systems have demonstrated remarkable ability to translate directly between numerous languages, making them increasingly appealing for real-world applications. However, when deployed in the wild, these models may generate hallucinated translations which have the potential to severely undermine user trust and raise safety concerns. Existing research on hallucinations has primarily focused on small bilingual models trained on high-resource languages, leaving a gap in our understanding of hallucinations in massively multilingual models across diverse translation scenarios. In this work, we fill this gap by conducting a comprehensive analysis on both the M2M family of conventional neural machine translation models and ChatGPT, a general-purpose large language model~(LLM) that can be prompted for translation. Our investigation covers a broad spectrum of conditions, spanning over 100 translation directions across various resource levels and going beyond English-centric language pairs. We provide key insights regarding the prevalence, properties, and mitigation of hallucinations, paving the way towards more responsible and reliable machine translation systems.
翻译:大规模多语言机器翻译系统展现出在众多语言之间直接进行翻译的卓越能力,这使得它们在真实应用场景中日益受到青睐。然而,当这些模型部署于开放环境时,可能会产生幻觉翻译,这会严重削弱用户信任并引发安全隐患。现有关于幻觉的研究主要聚焦于在高资源语言上训练的小型双语模型,这导致我们对大规模多语言模型在多样化翻译场景中幻觉现象的理解存在空白。在本工作中,我们通过全面分析传统神经机器翻译模型M2M系列以及可被提示用于翻译的通用大语言模型ChatGPT,填补了这一空白。我们的研究覆盖了广泛的条件范围,涵盖100多个翻译方向,涉及不同资源水平,并超越了以英语为中心的语言对。我们提供了关于幻觉的普遍性、特性及缓解策略的关键见解,为构建更负责任、更可靠的机器翻译系统铺平了道路。