GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

Vision-Language Models (VLMs) hallucinate objects that are not present, and a growing line of work tries to curb this by feeding the model its own generated caption as auxiliary evidence -- assuming that a caption, once available, is something to consume. We show this fails: naively appending a caption can lower accuracy rather than raise it, dropping Qwen2.5-VL-3B$^\dagger$ on HallusionBench by nearly ten points. To understand why, we build \textbf{GD-Probe}, a diagnostic set that pairs a global and a detail question on the same image, so that any difference in caption effect is attributable to the question alone. Caption utility proves to be a \emph{per-query} property: the same caption helps global questions and harms detail ones, through a single mechanism -- an embedded caption competes with the image for attention and pulls the model's evidence onto its own text -- whose sign is set by whether the caption \emph{covers} the queried content. Crucially, this regime is readable from quantities the decoder already emits, with no attention access or grounding. We turn this into \textbf{GEASS} (Gated Evidence-Adaptive Selective Caption Trust), a training-free, logit-level module that decides per query how much of the caption to trust, gating it by the clean path's confidence, weighting it by the entropy reduction it induces, and raising the evidence bar when the two pathways disagree. Across four VLMs and two benchmarks (POPE and HallusionBench), GEASS improves over both vanilla inference and contrastive decoding under a single fixed setting, adding only two forward passes and no parameters.

翻译：视觉-语言模型（VLM）会幻觉出实际不存在的物体。为遏制该问题，现有研究尝试将模型自身生成的描述作为辅助证据馈入模型——其隐含假设为：一旦生成描述，便可将其用作输入信息。然而我们证明此方法存在缺陷：简单附加描述不仅不能提升准确率，反而会使其下降——在HallusionBench基准上，Qwen2.5-VL-3B$^\dagger$的准确率因此降低近十个百分点。为探究成因，我们构建了**GD-Probe**诊断数据集，通过在同一图像上配对全局与细节问题，使得描述效果的差异可完全归因于问题类型。实验表明描述效用具有**逐查询**特性：同一描述通过单一机制（内嵌描述会与图像竞争注意力，使模型证据偏向文本）同时提升全局问题表现却损害细节问题表现——其影响方向取决于描述是否**覆盖**查询内容。关键的是，解码器无需注意力访问或视觉定位即可从其输出量中解析出该机制。基于此，我们提出**GEASS**（门控证据自适应选择性描述信任机制）：一个无需训练、在logit层面运行的轻量模块。该模块对每个查询动态判断对描述的可信程度，通过清晰路径的置信度进行门控、通过描述引发的熵降进行加权、并在两路径产生分歧时提升证据判别阈值。在四种VLM及两项基准测试（POPE和HallusionBench）中，GEASS在单一固定设置下同时优于标准推理与对比解码策略，仅需两次前向传播且不引入额外参数。