Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT). In this paper, we systematically investigate the advantages and challenges of LLMs for MMT by answering two questions: 1) How well do LLMs perform in translating a massive number of languages? 2) Which factors affect LLMs' performance in translation? We evaluate popular LLMs, including XGLM, OPT, BLOOMZ, and ChatGPT, on 102 languages. Our empirical results show that even the best model ChatGPT still lags behind the supervised baseline NLLB in 83.33% of translation directions. Through further analysis, we discover that LLMs exhibit new working patterns when used for MMT. First, prompt semantics can surprisingly be ignored when given in-context exemplars, where LLMs still show strong performance even with unreasonable prompts. Second, cross-lingual exemplars can provide better task instruction for low-resource translation than exemplars in the same language pairs. Third, we observe the overestimated performance of BLOOMZ on dataset Flores-101, indicating the potential risk when using public datasets for evaluation.
翻译:大语言模型在处理多语言机器翻译方面展现出显著潜力。本文通过回答两个问题系统探究了大语言模型在多语言机器翻译中的优势与挑战:1)大语言模型在翻译海量语言时的表现如何?2)哪些因素影响大语言模型的翻译性能?我们评估了XGLM、OPT、BLOOMZ和ChatGPT等主流大语言模型在102种语言上的表现。实证结果表明,即使是最优模型ChatGPT,在83.33%的翻译方向上仍落后于有监督基线模型NLLB。通过进一步分析,我们发现大语言模型在多语言机器翻译中呈现出新的工作模式:其一,在给定上下文示例时,提示语义可被意外忽略——即便使用不合理提示,大语言模型仍展现强劲性能;其二,跨语言示例相比同语言对示例能为低资源翻译提供更优任务指令;其三,我们观察到BLOOMZ在Flores-101数据集上的性能被高估,这揭示了使用公开数据集进行评估时的潜在风险。