Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.
翻译:大型语言模型(LLMs)在高资源语言对的机器翻译任务中已展现出卓越能力,但其在低资源翻译上的表现仍显不足。现有的后训练方法严重依赖高质量平行数据,而这些数据对于低资源语言往往稀缺或难以获取。本文提出WALAR,一种仅使用单语文本的强化训练方法,旨在提升LLMs在大量低资源语言上的翻译能力,同时保持其在高资源语言上的性能。我们的核心洞见基于对现有基于源语的多语言质量评估(QE)模型失效模式(或称“漏洞”)的观察。使用这些QE模型进行强化学习(RL)往往会放大此类漏洞,导致多语言LLMs性能下降。我们开发了包括词对齐和语言对齐在内的技术,以缓解WALAR中用于RL训练的奖励函数中的此类漏洞。我们使用WALAR持续训练了一个支持101种语言翻译的LLM。实验表明,在Flores-101数据集的1400个语言方向上,我们的新模型大幅超越了当前最强的开源多语言LLM之一——LLaMAX。