Look Again Before You Abstain:Budgeted Conformal Evidence Acquisition for Reliable Vision-Language Model

Large vision-language models (LVLMs) hallucinate: they assert visual details that the image does not support. A principled remedy is selective prediction with a distribution-free guarantee-verify each claim and abstain when the claim is not grounded, so that the hallucination rate among asserted claims is provably bounded. We show, however, that this guarantee is bought at a brutal price: to keep the hallucination rate below $5\%$ on a balanced object-existence benchmark, a state-of-the-art conformal filter must abstain on more than $80\%$ of claims. We argue that abstention is wasteful when more visual evidence is cheaply available, and introduce Budgeted Conformal Evidence Acquisition (BCEA), which replaces the binary answer/abstain decision with a three-way choice: answer, abstain, or acquire additional visual evidence by re-examining the image (zooming, cropping, or applying a claim-specific intervention) under a bounded compute budget. We make two observations. First, acquisition that is plugged naively into a calibrated filter breaks the statistical guarantee -- realized risk overshoots the target by up to $17$ points -- because the acquisition step destroys the exchangeability that conformal calibration relies on. Second, folding the entire acquisition policy into the score function and re-calibrating on post-acquisition scores \emph{restores} the finite-sample guarantee while still recovering coverage. BCEA further uses structured, claim-type-specific interventions. Across the POPE benchmark and COCO-constructed existence and spatial-relation claims, on four open VLMs, BCEA controls the hallucination rate at the target level and consistently improves coverage over a guaranteed-abstention baseline.

翻译：大型视觉语言模型（LVLMs）会产生幻觉：它们断言图像中并不支持的视觉细节。一种原则性的补救方法是具有无分布保证的选择性预测——验证每条陈述，并在陈述缺乏依据时放弃断言，从而使已断言陈述中的幻觉率得到可证明的约束。然而，我们表明，这种保证是以高昂代价换来的：在平衡的目标存在性基准测试中，要将幻觉率控制在5%以下，最先进的共形滤波器必须在超过80%的陈述上选择放弃。我们认为，当更多视觉证据可以廉价获取时，放弃是浪费的，并引入了预算化共形证据获取（BCEA），它将二元的回答/放弃决策替换为三元选择：回答、放弃，或在有限计算预算下通过重新检查图像（缩放、裁剪或应用特定于陈述的干预）来获取额外的视觉证据。我们有两个观察。首先，将获取操作朴素地插入校准后的滤波器会破坏统计保证——实际风险超出目标高达17个百分点——因为获取步骤破坏了共形校准所依赖的可交换性。其次，将整个获取策略融入得分函数，并在获取后的得分上重新校准，可以在恢复覆盖范围的同时恢复有限样本保证。BCEA进一步使用结构化的、特定于陈述类型的干预。在POPE基准测试以及COCO构建的存在性和空间关系陈述上，针对四个开放视觉语言模型，BCEA将幻觉率控制在目标水平，并一致地优于保证放弃基线方法的覆盖范围。