Rapid progress in large vision-language models (LVLMs) has achieved unprecedented performance in vision-language tasks. However, due to the strong prior of large language models (LLMs) and misaligned attention across modalities, LVLMs often generate outputs inconsistent with visual content - termed hallucination. To address this, we propose \textbf{Scalpel}, a method that reduces hallucination by refining attention activation distributions toward more credible regions. Scalpel predicts trusted attention directions for each head in Transformer layers during inference and adjusts activations accordingly. It employs a Gaussian mixture model to capture multi-peak distributions of attention in trust and hallucination manifolds, and uses entropic optimal transport (equivalent to Schrödinger bridge problem) to map Gaussian components precisely. During mitigation, Scalpel dynamically adjusts intervention strength and direction based on component membership and mapping relationships between hallucination and trust activations. Extensive experiments across multiple datasets and benchmarks demonstrate that Scalpel effectively mitigates hallucinations, outperforming previous methods and achieving state-of-the-art performance. Moreover, Scalpel is model- and data-agnostic, requiring no additional computation, only a single decoding step.
翻译:大型视觉语言模型(LVLMs)的快速发展在视觉语言任务中取得了前所未有的性能。然而,由于大语言模型(LLMs)的强大先验以及跨模态注意力未对齐,LVLMs 经常生成与视觉内容不一致的输出——这被称为幻觉。为解决此问题,我们提出了 \textbf{Scalpel},一种通过将注意力激活分布细化至更可信区域来减少幻觉的方法。Scalpel 在推理过程中预测 Transformer 层中每个注意力头的可信注意力方向,并相应地调整激活。它采用高斯混合模型来捕捉可信流形和幻觉流形中注意力的多峰分布,并使用熵最优传输(等价于薛定谔桥问题)来精确映射高斯分量。在缓解过程中,Scalpel 根据分量隶属度以及幻觉激活与可信激活之间的映射关系,动态调整干预强度和方向。在多个数据集和基准测试上进行的大量实验表明,Scalpel 能有效缓解幻觉,其性能优于先前方法,达到了最先进的水平。此外,Scalpel 具有模型无关性和数据无关性,无需额外计算,仅需单次解码步骤。