Moderate-sized large language models (LLMs) -- those with 7B or 13B parameters -- exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT'21, WMT'22 and WMT'23 test datasets.
翻译:中等规模的大语言模型(LLMs)——拥有7B或13B参数的模型——在机器翻译(MT)任务中展现出不错的潜力。然而,即便是表现最佳的13B LLM翻译模型(如ALMA),其性能仍不及最先进的传统编码器-解码器翻译模型或更大规模的LLMs(如GPT-4)。在本研究中,我们弥合了这一性能差距。首先,我们评估了监督微调(SFT)在大语言模型机器翻译任务中的不足,强调了参考数据(即使是由人工生成)中存在的质量问题。随后,与模仿参考翻译的SFT方法不同,我们引入对比偏好优化(CPO),这是一种新颖的方法,训练模型避免生成合格但不完美的翻译。将CPO应用于仅包含22K平行句子和12M参数的ALMA模型,带来了显著提升。由此得到的模型(称为ALMA-R)能够达到甚至超越WMT竞赛优胜者及GPT-4在WMT'21、WMT'22和WMT'23测试数据集上的表现。