Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallucinate and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a VLM is quite useful. Existing methods, such as estimating uncertainty using answer likelihoods or prompt-based confidence generation, often suffer from overconfidence. Other methods use self-consistency comparison but are affected by confirmation biases. To alleviate these, we propose \textbf{De}compose and \textbf{C}ompare \textbf{C}onsistency (\texttt{DeCC}) for reliability measurement. By comparing the consistency between the direct answer generated using the VLM's internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions and reasoning over the sub-answers produced by the VLM, \texttt{DeCC} measures the reliability of VLM's direct answer. Experiments across six vision-language tasks with three VLMs show \texttt{DeCC}'s reliability estimation achieves better correlation with task accuracy compared to the existing methods.
翻译:尽管取得了巨大进展,当前最先进的视觉语言模型(VLMs)仍远非完美。它们倾向于产生幻觉,并可能生成带有偏见的回答。在此情况下,拥有一种评估VLM所生成给定回答可靠性的方法非常有用。现有方法,例如基于答案似然性估计不确定性或基于提示生成置信度,常受过度自信问题困扰。其他方法使用自一致性比较,但易受确认偏误影响。为缓解这些问题,我们提出用于可靠性测量的**分解与比较一致性**(\texttt{DeCC})。该方法通过比较VLM内部推理过程生成的直接答案,与将问题分解为子问题并通过VLM产生的子答案进行推理所得的间接答案之间的一致性,来衡量VLM直接答案的可靠性。在涵盖六个视觉语言任务及三种VLM的实验表明,与现有方法相比,\texttt{DeCC}的可靠性估计与任务准确率具有更好的相关性。