Large Vision-Language Models (LVLMs) frequently suffer from severe hallucination issues. Existing mitigation strategies predominantly rely on isolated, single-step states to enhance visual focus or suppress strong linguistic priors. However, these static approaches neglect dynamic context changes across the generation process and struggles to correct inherited information loss. To address this limitation, we propose Adaptive Context inTegration (ACT), a training-free inference intervention method that mitigates hallucination through the adaptive integration of contextual information. Specifically, we first propose visual context exploration, which leverages spatio-temporal profiling to adaptively amplify attention heads responsible for visual exploration. To further facilitate vision-language alignment, we propose semantic context aggregation that marginalizes potential semantic queries to effectively aggregate visual evidence, thereby resolving the information loss caused by the discrete nature of token prediction. Extensive experiments across diverse LVLMs demonstrate that ACT significantly reduces hallucinations and achieves competitive results on both discriminative and generative benchmarks, acting as a robust and highly adaptable solution without compromising fundamental generation capabilities.
翻译:大型视觉语言模型(LVLMs)常面临严重的幻觉问题。现有缓解策略主要依赖孤立的单步状态来增强视觉聚焦或抑制强语言先验。然而,这些静态方法忽略了生成过程中的动态上下文变化,且难以纠正继承的信息损失。为解决这一局限性,我们提出自适应上下文集成方法(ACT),一种无需训练的推理干预方法,通过自适应整合上下文信息来减轻幻觉。具体而言,我们首先提出视觉上下文探索,利用时空特征分析自适应放大负责视觉探索的注意力头。为进一步促进视觉语言对齐,我们提出语义上下文聚合,通过边缘化潜在语义查询有效聚合视觉证据,从而解决由标记预测离散性导致的信息损失。在多种LVLM上的广泛实验表明,ACT显著减少幻觉,并在判别性和生成性基准测试中取得竞争性结果,作为鲁棒且高度可适应的解决方案,不影响基础生成能力。