Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood -- an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.
翻译:直接偏好优化(DPO)已被证明是大语言模型(LLM)对齐的有效方法。近期研究尝试将DPO应用于多模态场景,但发现难以实现一致的性能提升。通过对比实验,我们识别出多模态偏好优化中的无条件偏好问题,即模型忽视了图像条件。为解决此问题,我们提出了mDPO,这是一种多模态DPO目标,通过同时优化图像偏好来防止对纯语言偏好的过度优先化。此外,我们引入了一个奖励锚点,强制要求被选回答的奖励为正,从而避免其似然性下降——这是相对偏好优化的固有问题。在两个不同规模的多模态LLM和三个广泛使用的基准测试上的实验表明,mDPO有效解决了多模态偏好优化中的无条件偏好问题,并显著提升了模型性能,特别是在减少幻觉方面。