We investigate the internal representations of vision-language models (VLMs) to address hallucinations, a persistent challenge despite advances in model size and training. We project VLMs' internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects. We additionally use these output probabilities to spatially localize real objects. Building on this approach, we introduce a knowledge erasure algorithm that removes hallucinations by linearly orthogonalizing image features with respect to hallucinated object features. We show that targeted edits to a model's latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs' latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation.
翻译:本研究探究视觉-语言模型(VLMs)的内部表征机制,以应对幻觉问题——尽管模型规模和训练方法不断进步,该问题仍是持续存在的挑战。我们将VLMs的内部图像表征投影至其语言词汇空间,观察到模型对真实物体的输出概率置信度显著高于幻觉物体。此外,我们利用这些输出概率实现了对真实物体的空间定位。基于此方法,我们提出一种知识擦除算法,通过将图像特征与幻觉物体特征进行线性正交化处理来消除幻觉。实验表明,对模型潜在表征进行定向编辑可在COCO2014数据集上将幻觉减少高达25.7%,同时保持模型性能。我们的研究证明,深入理解VLMs的潜在表征机制不仅能提升模型可靠性,还能实现零样本分割等新型能力。