Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .
翻译:尽管大视觉语言模型(LVLMs)在能力上取得了显著进展,但这些系统仍易受幻觉影响,即输出结果缺乏视觉输入的支撑。以往研究将LVLMs中的幻觉归因于视觉骨干网络的局限性或语言成分的主导作用等因素,但这些因素的相对重要性仍不明确。为解决这一模糊问题,我们提出了HalluScope基准,旨在更深入地理解不同因素引发幻觉的程度。我们的分析表明,幻觉主要源于对文本先验知识和背景知识的过度依赖,尤其是通过文本指令引入的信息。为减轻文本指令先验知识引发的幻觉,我们提出了HalluVL-DPO框架,用于对现成的LVLMs进行微调,使其生成更贴近视觉根据的响应。HalluVL-DPO利用我们构建的精心策划的训练数据集进行偏好优化,引导模型优先选择有根据的响应而非幻觉响应。我们证明,优化后的模型能有效缓解目标幻觉故障模式,同时在其他幻觉基准测试和视觉能力评估中保持或提升性能。为支持可复现性和进一步研究,我们将在 https://pegah-kh.github.io/projects/prompts-override-vision/ 公开发布评估基准、偏好训练数据集和代码。