Large Vision-Language Models (LVLMs) hold significant promise for medical applications, yet their deployment is often constrained by insufficient alignment and reliability. While Direct Preference Optimization (DPO) has emerged as a potent framework for refining model responses, its efficacy in high-stakes medical contexts remains underexplored, lacking the rigorous empirical groundwork necessary to guide future methodological advances. To bridge this gap, we present the first comprehensive examination of diverse DPO variants within the medical domain, evaluating nine distinct formulations across two medical LVLMs: LLaVA-Med and HuatuoGPT-Vision. Our results reveal several critical limitations: current DPO approaches often yield inconsistent gains over supervised fine-tuning, with their efficacy varying significantly across different tasks and backbones. Furthermore, they frequently fail to resolve fundamental visual misinterpretation errors. Building on these insights, we present a targeted preference construction strategy as a proof-of-concept that explicitly addresses visual misinterpretation errors frequently observed in existing DPO models. This design yields a 3.6% improvement over the strongest existing DPO baseline on visual question-answering tasks. To support future research, we release our complete framework, including all training data, model checkpoints, and our codebase at https://github.com/dmis-lab/med-vlm-dpo.
翻译:大型视觉语言模型(LVLMs)在医疗应用领域展现出巨大潜力,但其部署常受限于对齐不足和可靠性问题。尽管直接偏好优化(DPO)已成为优化模型响应的有效框架,其在高风险医疗场景中的效能仍未得到充分探索,缺乏指导未来方法学进展的严谨实证基础。为填补这一空白,我们首次在医疗领域对多种DPO变体进行了全面检验,在LLaVA-Med和HuatuoGPT-Vision两种医疗LVLM上评估了九种不同的实现方案。研究结果揭示了若干关键局限:当前DPO方法相较于监督微调往往仅能获得不一致的性能提升,其效果在不同任务和骨干网络间存在显著差异。此外,这些方法通常无法解决根本性的视觉误判错误。基于这些发现,我们提出了一种针对性偏好构建策略作为概念验证,该策略明确针对现有DPO模型中常见的视觉误判错误进行优化。该设计在视觉问答任务上较现有最强DPO基线实现了3.6%的性能提升。为支持未来研究,我们完整发布了研究框架,包括所有训练数据、模型检查点及代码库(https://github.com/dmis-lab/med-vlm-dpo)。