Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.
翻译:大型视觉语言模型(LVLMs)已在医学影像任务中展现出强劲性能,但仍易出现事实不一致、视觉基础薄弱以及与临床有意义的反馈不匹配等问题。现有的后训练对齐方法,包括直接偏好优化(DPO)及其变体,在医学领域面临三个关键局限:(1)序列级奖励信号将临床关键标记与通用填充文本同等处理;(2)依赖静态监督微调参考作为偏好响应导致离策略分布偏移,使优化偏向于风格化伪影而非临床准确性;(3)对齐目标缺乏显式的视觉基础约束,使模型对细微但具有诊断决定性的病理特征不敏感。我们的方法采用双向逐词KL正则化项与视觉对比基础目标相结合,该目标通过配对干净图像与病灶损坏图像,对缺乏充分视觉证据的响应进行惩罚。这些组件共同构成一个细粒度的、在策略的对齐框架,通过最小化编辑模型生成输出来构建偏好对,仅纠正临床错误片段,同时保留原始语言风格。在医学影像任务和临床文本生成基准上的广泛实验验证了我们方法的有效性。