Contemporary neural machine translation (NMT) systems are almost exclusively built by training on supervised parallel data. Despite the tremendous progress achieved, these systems still exhibit persistent translation errors. This paper proposes that a post-training paradigm based on reinforcement learning (RL) can effectively rectify such mistakes. We introduce a novel framework that requires only a general text corpus and an expert translator which can be either human or an AI system to provide iterative feedback. In our experiments, we focus specifically on English-to-German translation as a representative high-resource language pair. Crucially, we implement this RL-based post-training using Direct Preference Optimization (DPO). Applying our DPO-driven framework to the gemma3-1b model yields a significant improvement in translation quality, elevating it's COMET score from 0.703 to 0.747 on the English to German task. The results demonstrate that DPO offers an efficient and stable pathway for enhancing pre-trained NMT models through preference-based post-training.
翻译:当代神经机器翻译(NMT)系统几乎完全依赖监督平行数据训练构建。尽管取得了巨大进展,这些系统仍存在持续性的翻译错误。本文提出基于强化学习(RL)的后训练范式可有效纠正此类错误。我们提出一种新型框架,仅需通用文本语料库和可提供迭代反馈的专家翻译器(人类或AI系统均可)。在实验中,我们重点针对英语到德语这一代表性高资源语言对进行研究。关键之处在于,我们采用直接偏好优化(DPO)实现基于强化学习的后训练。将我们的DPO驱动框架应用于gemma3-1b模型后,翻译质量显著提升:在英德翻译任务中,该模型的COMET评分从0.703跃升至0.747。实验结果表明,DPO通过基于偏好的后训练,为增强预训练NMT模型提供了一条高效且稳定的途径。