Reinforcement Learning from Human Feedback (RLHF) and derivative techniques like Direct Preference Optimization (DPO) are task-alignment algorithms used to repurpose general, foundational models for specific tasks. We show that applying task-alignment to neural machine translation (NMT) addresses an existing task--data mismatch in NMT, leading to improvements across all languages of a multilingual model, even when task-alignment is only applied to a subset of those languages. We do so by introducing Direct Quality Optimization (DQO), a variant of DPO leveraging a pre-trained translation quality estimation model as a proxy for human preferences, and verify the improvements with both automatic metrics and human evaluation.
翻译:基于人类反馈的强化学习(RLHF)及其衍生技术(如直接偏好优化(DPO))是用于将通用基础模型适配至特定任务的任务对齐算法。本文证明,将任务对齐技术应用于神经机器翻译(NMT)能够解决当前NMT中存在的任务-数据失配问题,即使在仅对多语言模型的部分语言实施任务对齐的情况下,也能提升该模型所有语言的翻译性能。我们通过提出直接质量优化(DQO)方法实现这一目标——该方法作为DPO的变体,利用预训练的翻译质量评估模型作为人类偏好的代理,并通过自动评估指标与人工评估共同验证了性能提升。