Omnilingual MT: Machine Translation for 1,600 Languages

Omnilingual MT Team,Belen Alastruey,Niyati Bafna,Andrea Caciolai,Kevin Heffernan,Artyom Kozhevnikov,Christophe Ropers,Eduardo Sánchez,Charles-Eric Saint-James,Ioannis Tsiamas,Chierh Cheng,Joe Chuang,Paul-Ambroise Duquenne,Mark Duppenthaler,Nate Ekberg,Cynthia Gao,Pere Lluís Huguet Cabot,João Maria Janeiro,Jean Maillard,Gabriel Mejia Gonzalez,Holger Schwenk,Edan Toledo,Arina Turkatenko,Albert Ventayol-Boada,Rashel Moritz,Alexandre Mourachko,Surya Parimi,Mary Williamson,Shireen Yates,David Dale,Marta R. Costa-jussà

High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

翻译：高质量的机器翻译（MT）可扩展至数百种语言，为多语言系统设定了高标准。然而，相较于全球7000种语言，现有系统的覆盖范围仍然有限：目标端约200种语言，源端借助跨语言迁移可能额外支持数百种。且因缺乏可靠基准与评估指标，这些数字本身也难以验证。我们提出万语机器翻译（OMT），这是首个支持超过1600种语言的机器翻译系统。这一规模得益于全面的数据策略，该策略整合了大型公共多语言语料库与新增数据集（包括人工精校的MeDLEY平行语料）。我们探索了两种将大语言模型（LLM）专用于机器翻译的方式：作为纯解码器模型（OMT-LLaMA）或作为编码器-解码器架构中的模块（OMT-NLLB）。值得注意的是，我们所有1B至8B参数规模的模型均匹配或超越70B参数LLM基线的翻译性能，展现出显著的专业化优势，并在低计算资源场景下实现强翻译质量。此外，我们对英语至1600种语言的翻译评估进一步表明：基线模型虽能理解低资源语言，却常无法生成具有语义保真度的译文；而OMT-LLaMA模型显著扩展了可实现连贯生成的语言集合。同时，OMT模型在跨语言迁移方面取得改进，已接近解决所评估1600种语言翻译问题中"理解"环节的难题。我们的排行榜及主要人工构建评估数据集（BOUQuET与Met-BOUQuET）正动态演进以支持万语翻译，并已免费开放。