Despite the advanced capabilities of Large Vision-Language Models (LVLMs), they frequently suffer from object hallucination. One reason is that visual features and pretrained textual representations often become intertwined in the deeper network layers. To address this, we propose REVIS, a training-free framework designed to explicitly re-activate this suppressed visual information. Rooted in latent space geometry, REVIS extracts the pure visual information vector via orthogonal projection and employs a calibrated strategy to perform sparse intervention only at the precise depth where suppression occurs. This surgical approach effectively restores visual information with minimal computational cost. Empirical evaluations on standard benchmarks demonstrate that REVIS reduces object hallucination rates by approximately 19% compared to state-of-the-art baselines, while preserving general reasoning capabilities.
翻译:尽管大型视觉语言模型(LVLMs)具备先进能力,却常受物体幻觉问题困扰。其原因之一是视觉特征与预训练文本表征在深层网络结构中往往相互交织。为解决此问题,我们提出REVIS——一种无需训练即可显式重激活被抑制视觉信息的框架。该方法基于潜在空间几何原理,通过正交投影提取纯净视觉信息向量,并采用校准策略仅在发生抑制的精确网络深度进行稀疏干预。这种精准调控方式能以极低计算成本有效恢复视觉信息。在标准基准测试上的实证评估表明,相较于最先进的基线模型,REVIS能将物体幻觉率降低约19%,同时保持模型的通用推理能力。