Moderate-sized large language models (LLMs) -- those with 7B or 13B parameters -- exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT'21, WMT'22 and WMT'23 test datasets.
翻译:中等规模的大语言模型(LLMs)——即参数量为7B或13B的模型——在机器翻译(MT)任务中展现出有前景的性能。然而,即使是表现最佳的基于13B参数LLM的翻译模型(如ALMA),其性能也未能匹敌最先进的传统编码器-解码器翻译模型或更大规模的LLM(如GPT-4)。在本研究中,我们弥合了这一性能差距。我们首先评估了监督微调在LLM执行MT任务中的不足,强调了参考数据中存在的质量问题——尽管这些数据是人工生成的。随后,与模仿参考翻译的监督微调不同,我们引入了对比偏好优化(CPO),这是一种新颖的方法,旨在训练模型避免生成合格但不完美的翻译。将CPO应用于仅使用22K平行句对和12M参数的ALMA模型,带来了显著的性能提升。由此产生的模型(称为ALMA-R)在WMT'21、WMT'22和WMT'23测试数据集上,能够匹配甚至超越WMT竞赛优胜者及GPT-4的表现。