Moderate-sized large language models (LLMs) -- those with 7B or 13B parameters -- exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT'21, WMT'22 and WMT'23 test datasets.
翻译:中等规模的大语言模型(LLMs)——参数规模为7B或13B的模型——在机器翻译(MT)任务中展现出有前景的性能。然而,即使是最优的基于13B LLM的翻译模型(如ALMA),其表现仍不及最先进的传统编码器-解码器翻译模型或更大规模的LLM(如GPT-4)。在本研究中,我们弥合了这一性能差距。首先,我们评估了LLM在MT任务中监督微调的缺陷,强调了参考数据中存在的质量问题(尽管这些数据由人工生成)。随后,与模仿参考翻译的SFT方法不同,我们引入了对比偏好优化(CPO),这是一种新颖的方法,旨在训练模型避免生成虽然合格但不完美的翻译。将CPO应用于仅含22K平行句和12M参数的ALMA模型,取得了显著改进。由此产生的模型名为ALMA-R,在WMT'21、WMT'22和WMT'23测试数据集上,其性能可媲美甚至超越WMT竞赛优胜者及GPT-4。