Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift

Hallucination remains a fundamental challenge in vision-language models (VLMs), where autoregressive generation may produce linguistically plausible yet physically inconsistent or visually ungrounded responses due to likelihood maximization under joint probabilistic modeling. We propose a stage-wise preference optimization framework for hallucination reduction through targeted multimodal data construction. Rather than directly optimizing on generic instruction-following data, our approach progressively constructs hallucination-focused preference pairs near known failure boundaries. The framework emphasizes ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. Hallucinated negatives are generated through minimally perturbed yet visually inconsistent alternatives, enabling Direct Preference Optimization (DPO) to better separate grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses. Cross-model qualitative evaluation further shows that the proposed multimodal LLM DPO framework produces more visually grounded responses than several frontier proprietary VLMs, such as in ambiguous spatial reasoning and adversarial false-premise settings. The results suggest that hallucination may arise not only from limited model capacity, but also from inherent tendencies of autoregressive probabilistic generation to favor linguistically plausible continuations under weak visual grounding. Future work may explore physical consistency modeling, uncertainty-aware multimodal reasoning, and architectural alternatives beyond standard autoregressive decoding.

翻译：摘要：幻觉仍是视觉语言模型（VLM）面临的根本性挑战，其自回归生成过程可能因联合概率建模下的似然最大化而产生语言合理、但物理不一致或缺乏视觉依据的响应。我们提出一种通过目标性多模态数据构建实现幻觉抑制的阶段式偏好优化框架。该方法并非直接优化通用指令遵循数据，而是在已知失效边界附近逐步构建聚焦幻觉的偏好对。该框架重点关注模糊空间方向、物体关系、OCR不确定性与对抗性虚假前提训练。通过产生最小扰动但视觉不一致的替代样本生成幻觉负例，使直接偏好优化（DPO）能更好地区分基于视觉依据的推理与似是而非的幻觉。在开源基准测试和真实多模态评估场景上的实验表明，该方法提升了推理一致性、降低了幻觉率，并生成更具信息量的可验证响应。跨模型定性评估进一步显示，所提多模态大语言模型DPO框架在模糊空间推理与对抗性虚假前提等场景中，产生的视觉依据响应优于数个前沿专有VLM。研究结果表明，幻觉不仅源于模型能力限制，更可能源于自回归概率生成在弱视觉约束下偏好语言合理性延续的固有倾向。未来工作可探索物理一致性建模、不确定性感知多模态推理，以及标准自回归解码之外的架构替代方案。