Minimum Bayes Risk (MBR) decoding can significantly improve translation performance of Multilingual Large Language Models (MLLMs). However, MBR decoding is computationally expensive and in this paper, we show how recently developed Reinforcement Learning (RL) technique, Direct Preference Optimization (DPO) can be used to fine-tune MLLMs so that we get the gains from MBR without the additional computation in inference. Our fine-tuned models have significantly improved performance on multiple NMT test sets compared to base MLLMs without preference optimization. Our method boosts the translation performance of MLLMs using relatively small monolingual fine-tuning sets.
翻译:最小贝叶斯风险(MBR)解码可显著提升多语言大语言模型(MLLMs)的翻译性能。然而,MBR解码计算成本高昂。本文展示了如何利用最新发展的强化学习(RL)技术——直接偏好优化(DPO)对MLLMs进行微调,使得模型在推理阶段无需额外计算即可获得MBR带来的性能提升。与未经过偏好优化的基础MLLMs相比,我们的微调模型在多个神经机器翻译(NMT)测试集上展现出显著更优的表现。该方法通过使用相对较小的单语言微调数据集即可增强MLLMs的翻译性能。