High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.
翻译:高质量机器翻译(MT)已能扩展至数百种语言,为多语言系统设立了高标准。然而,与全球7,000种语言相比,现有系统仍仅提供有限覆盖:目标端约200种语言,源端可能因跨语言迁移支持数百种更多语言。由于缺乏可靠的基准和指标,这些数字甚至难以评估。我们提出全能语言机器翻译(OMT),这是首个支持超过1,600种语言的机器翻译系统。此规模通过综合数据策略实现,该策略整合了大型公共多语言语料库与新创建的数据集,包括人工整理的MeDLEY双语文本。我们探索了两种将大语言模型(LLM)专门化用于机器翻译的方法:作为仅解码器模型(OMT-LLaMA)或作为编码器-解码器架构中的模块(OMT-NLLB)。值得注意的是,我们所有10亿至80亿参数模型均达到或超过700亿参数LLM基线的机器翻译性能,显示出明确的专门化优势,并在低计算资源环境下实现强大的翻译质量。此外,我们对英语到1,600种语言翻译的评估进一步表明,虽然基线模型能理解支持不足的语言,但经常无法以有意义的保真度生成这些语言;OMT-LLaMA模型大幅扩展了可实现连贯生成的语言集合。同时,OMT模型在跨语言迁移方面有所改进,已接近解决这1,600种语言机器翻译中“理解”部分的难题。我们的排行榜和主要人工创建评估数据集(BOUQuET和Met-BOUQuET)正动态向全能语言性演进,并免费开放。