Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty imbalance in preference data. Our analysis shows that MLLMs tend to overemphasize easily distinguishable preference pairs, which hinders fine-grained hallucination suppression and degrades overall performance. To address this issue, we propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process. DA-DPO consists of two main components: (1) Difficulty Estimation leverages pre-trained vision--language models with complementary generative and contrastive objectives, whose outputs are integrated via a distribution-aware voting strategy to produce robust difficulty scores without additional training; and (2) Difficulty-Aware Training reweights preference pairs based on their estimated difficulty, down-weighting easy samples while emphasizing harder ones to alleviate overfitting. This framework enables more effective preference optimization by prioritizing challenging examples, without requiring new data or extra fine-tuning stages. Extensive experiments demonstrate that DA-DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks, while remaining computationally efficient. The project page is available at https://artanic30.github.io/project_pages/DA-DPO/.
翻译:直接偏好优化(DPO)在缓解多模态大语言模型(MLLM)的幻觉方面展现出强大潜力。然而,现有多模态DPO方法常因偏好数据中存在的难度失衡问题而导致过拟合。我们的分析表明,MLLM倾向于过度关注易于区分的偏好对,这阻碍了细粒度幻觉抑制并导致整体性能下降。为解决此问题,我们提出难度感知直接偏好优化(DA-DPO),这是一个旨在平衡学习过程的成本高效型框架。DA-DPO包含两个核心组件:(1)难度估计模块利用预训练的视觉-语言模型,结合生成式与对比式互补目标,通过分布感知投票策略整合其输出以生成鲁棒的难度分数,无需额外训练;(2)难度感知训练模块根据估计难度对偏好对进行重新加权,降低简单样本的权重同时强调困难样本,从而缓解过拟合。该框架通过优先处理具有挑战性的样本,实现了更有效的偏好优化,且无需新数据或额外微调阶段。大量实验表明,DA-DPO能持续改进多模态偏好优化,在标准基准测试中展现出更强的幻觉鲁棒性和更好的泛化能力,同时保持计算高效性。项目页面详见 https://artanic30.github.io/project_pages/DA-DPO/。