Minimum Bayes Risk (MBR) decoding can significantly improve translation performance of Multilingual Large Language Models (MLLMs). However, MBR decoding is computationally expensive. We show how the recently developed Reinforcement Learning technique, Direct Preference Optimization (DPO), can fine-tune MLLMs to get the gains of MBR without any additional computation in inference. Our method uses only a small monolingual fine-tuning set and yields significantly improved performance on multiple NMT test sets compared to MLLMs without DPO.
翻译:最小贝叶斯风险(MBR)解码能显著提升多语言大语言模型(MLLMs)的翻译性能,但其计算开销高昂。我们展示了如何通过近期发展的强化学习技术——直接偏好优化(DPO)——对MLLMs进行微调,在不增加推理阶段计算量的前提下获得MBR的性能增益。该方法仅需少量单语微调数据,在多个神经机器翻译(NMT)测试集上相较未使用DPO的MLLMs均展现出显著性能提升。