Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.
翻译:大型视觉语言模型(LVLM)经常产生幻觉,限制了其在现实应用中的安全部署。现有的大语言模型自评估方法依赖于模型对自身输出正确性的估计能力,这可以提升部署可靠性;然而,这些方法严重依赖语言先验,因此不适合用于评估以视觉为条件的预测。我们提出了VAUQ,一种面向LVLM自评估的视觉感知不确定性量化框架,它显式地衡量模型的输出在多大程度上依赖于视觉证据。VAUQ引入了图像信息分数(IS),该分数捕捉了由视觉输入带来的预测不确定性的减少,并提出了一种无监督的核心区域掩蔽策略,以增强显著区域的影响。将预测熵与此核心掩蔽IS相结合,产生了一个无需训练的打分函数,能够可靠地反映答案的正确性。综合实验表明,VAUQ在多个数据集上始终优于现有的自评估方法。