Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT). Previous studies explore aspects of LLMs' MT capabilities. However, there exist a wide variety of languages for which recent LLM MT performance has never before been evaluated. Without published experimental evidence on the matter, it is difficult for speakers of the world's diverse languages to know how and whether they can use LLMs for their languages. We present the first experimental evidence for an expansive set of 204 languages, along with MT cost analysis, using the FLORES-200 benchmark. Trends reveal that GPT models approach or exceed traditional MT model performance for some high-resource languages (HRLs) but consistently lag for low-resource languages (LRLs), under-performing traditional MT for 84.1% of languages we covered. Our analysis reveals that a language's resource level is the most important feature in determining ChatGPT's relative ability to translate it, and suggests that ChatGPT is especially disadvantaged for LRLs and African languages.
翻译:大型语言模型(LLMs)隐式学习了执行包括机器翻译(MT)在内的多种语言任务。已有研究探讨了LLMs在机器翻译能力方面的不同维度,然而,对于大量语言,近期LLM的机器翻译性能尚未得到评估。由于缺乏已发表的实验证据,全球多元语言的用户难以了解如何以及是否能够利用LLMs处理其语言。本文首次针对FLORES-200基准中的204种语言提供了实验证据,并进行了机器翻译成本分析。趋势表明,GPT模型在高资源语言(HRLs)上的表现接近甚至超越传统MT模型,但在低资源语言(LRLs)上始终落后,覆盖的84.1%语言中表现不及传统MT。我们的分析揭示,语言资源水平是决定ChatGPT翻译相对能力的最重要特征,表明ChatGPT在低资源语言和非洲语言上尤其处于劣势。