Despite the rapid success of Large Vision-Language Models (LVLMs), a persistent challenge is their tendency to generate hallucinated content, undermining reliability in real-world use. Existing training-free methods address hallucinations but face two limitations: (i) they rely on narrow assumptions about hallucination sources, and (ii) their effectiveness declines toward the end of generation, where hallucinations are most likely to occur. A common strategy is to build hallucinated models by completely or partially removing visual tokens and contrasting them with the original model. Yet, this alone proves insufficient, since visual information still propagates into generated text. Building on this insight, we propose a novel hallucinated model that captures hallucination effects by selectively removing key text tokens. We further introduce Generalized Contrastive Decoding, which integrates multiple hallucinated models to represent diverse hallucination sources. Together, these ideas form CRoPS, a training-free hallucination mitigation framework that improves CHAIR scores by 20% and achieves consistent gains across six benchmarks and three LVLM families, outperforming state-of-the-art training-free methods.
翻译:尽管大型视觉语言模型(LVLMs)取得了快速成功,但其倾向于生成幻觉内容这一持续存在的挑战,削弱了在实际应用中的可靠性。现有的无需训练方法虽然能处理幻觉问题,但面临两个局限:(i)它们依赖于对幻觉来源的狭隘假设;(ii)其有效性在生成过程接近尾声时下降,而该阶段恰好是幻觉最可能发生的区域。一种常见策略是通过完全或部分移除视觉标记来构建幻觉模型,并将其与原始模型进行对比。然而,仅此方法被证明是不充分的,因为视觉信息仍会传播到生成的文本中。基于这一洞见,我们提出了一种新颖的幻觉模型,该模型通过选择性移除关键文本标记来捕捉幻觉效应。我们进一步引入了广义对比解码,该方法整合了多个幻觉模型以表征多样化的幻觉来源。这些思想共同构成了CRoPS——一个无需训练的幻觉缓解框架,该框架将CHAIR分数提升了20%,并在六个基准测试和三个LVLM系列中实现了持续的性能增益,其表现超越了当前最先进的无需训练方法。