Despite achieving rapid developments and with widespread applications, Large Vision-Language Models (LVLMs) confront a serious challenge of being prone to generating hallucinations. An over-reliance on linguistic priors has been identified as a key factor leading to these hallucinations. In this paper, we propose to alleviate this problem by introducing a novel image-biased decoding (IBD) technique. Our method derives the next-token probability distribution by contrasting predictions from a conventional LVLM with those of an image-biased LVLM, thereby amplifying the correct information highly correlated with image content while mitigating the hallucinatory errors caused by excessive dependence on text. We further conduct a comprehensive statistical analysis to validate the reliability of our method, and design an adaptive adjustment strategy to achieve robust and flexible handling under varying conditions. Experimental results across multiple evaluation metrics verify that our method, despite not requiring additional training data and only with a minimal increase in model parameters, can significantly reduce hallucinations in LVLMs and enhance the truthfulness of the generated response.
翻译:尽管大型视觉-语言模型(LVLMs)取得了快速发展并得到广泛应用,但它们面临着一个严峻挑战:容易产生幻觉。过度依赖语言先验已被确定为导致这些幻觉的关键因素。在本文中,我们提出通过引入一种新颖的图像偏置解码(IBD)技术来缓解这一问题。该方法通过对比传统LVLM与图像偏置LVLM的预测结果来推导下一个词的概率分布,从而增强与图像内容高度相关的正确信息,同时减少因过度依赖文本而产生的幻觉错误。我们进一步进行了全面的统计分析以验证方法的可靠性,并设计了一种自适应调整策略,以在不同条件下实现稳健灵活的处理。多个评估指标上的实验结果证实,我们的方法无需额外训练数据,且仅增加极少的模型参数,即可显著减少LVLM中的幻觉,并提高生成响应的真实性。