Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io).
翻译:大型视觉语言模型(LVLMs)在各种视觉语言任务中取得了令人瞩目的成果。然而,尽管表现出优异的性能,LVLMs仍受到语言偏见引起的幻觉问题困扰,导致对图像的关注度降低和视觉理解能力不足。我们识别出造成这种偏见的两个主要原因:1. 大语言模型预训练阶段与多模态对齐阶段训练数据规模的差异。2. 由文本数据的短期依赖性导致的学习推断偏见。为此,我们提出了LACING,这是一个旨在通过多模态双重注意力机制(MDA)和软图像引导(IFG)来解决LVLMs语言偏见的系统性框架。具体而言,MDA引入了一种并行的双重注意力机制,以增强视觉输入在整个模型中的整合。IFG在训练和推断过程中引入可学习的软视觉提示来替代视觉输入,旨在迫使LVLMs优先处理文本输入。随后,IFG进一步提出了一种利用软视觉提示的新型解码策略,以减轻模型对相邻文本输入的过度依赖。综合实验表明,我们的方法能有效消除LVLMs的语言偏见,增强视觉理解能力并减少幻觉,且无需额外的训练资源或数据。代码和模型可在[lacing-lvlm.github.io](https://lacing-lvlm.github.io)获取。