Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.
翻译:幻觉问题,即生成与视觉输入不一致的响应,仍然是大型视觉-语言模型(LVLMs)在开放式任务(如图像描述和视觉推理)中的一个关键限制。本研究探究了驱动幻觉的逐层生成动态,并提出了一种无需训练的缓解策略。通过采用Logit Lens方法,我们考察了LVLMs如何在解码器层间构建下一个词元的分布,揭示了一个显著的承诺深度差距:真实词元比幻觉词元更早地在最终候选词上累积概率质量。基于这一发现,我们引入了上下文嵌入注入(CEI),这是一种轻量级方法,利用最后一个输入词元的隐藏状态——即上下文嵌入——作为接地信号,以在整个解码过程中保持视觉保真度并抑制幻觉。在CHAIR、AMBER和MMHal-Bench基准测试(最大词元长度为512)上的评估表明,CEI在三种LVLMs中均优于现有最先进的基线方法,其动态变体实现了最低的整体幻觉率。通过将新颖的机制洞察与可扩展的干预措施相结合,本研究推动了LVLMs中幻觉缓解的进展。