Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastive decoding has been proposed to mitigate this issue, existing methods often apply generic visual augmentations that disregard the specific context provided by the text query, limiting their effectiveness. This study introduces a novel training-free decoding strategy that addresses these limitations, featuring two key contributions. First, a self-augmentation prompting strategy that leverages the intrinsic knowledge of the model to dynamically align semantics between the query and the visual augmentation. Second, an adaptive thresholding algorithm that adaptively adjusts next token candidate size based on the output sparsity, utilizing full information from the logit distribution. Extensive experiments across four LVLMs and seven benchmarks demonstrate that the proposed decoding significantly enhances factual consistency compared to state-of-the-art decoding methods. This work highlights the importance of integrating query-dependent augmentation and entropy-aware decoding for improving effective generation of LVLMs.
翻译:大型视觉语言模型(LVLMs)已展现出卓越的多模态能力,但其继承了底层语言模型的幻觉倾向。尽管视觉对比解码方法已被提出以缓解此问题,现有方法通常采用通用的视觉增强策略,忽视了文本查询所提供的具体上下文,从而限制了其有效性。本研究提出了一种新颖的无训练解码策略以解决这些局限,其包含两个关键贡献。首先,一种自增强提示策略,利用模型的内在知识动态对齐查询与视觉增强之间的语义。其次,一种自适应阈值算法,基于输出稀疏性自适应调整下一令牌候选集大小,充分利用对数概率分布的全部信息。在四种LVLM和七个基准测试上的广泛实验表明,相较于最先进的解码方法,所提出的解码策略显著提升了事实一致性。这项工作强调了整合查询相关的增强与熵感知解码对于提升LVLM有效生成的重要性。