Vision language models (VLMs) demonstrate impressive capabilities in visual question answering and image captioning, acting as a crucial link between visual and language models. However, existing open-source VLMs heavily rely on pretrained and frozen vision encoders (such as CLIP). Despite CLIP's robustness across diverse domains, it still exhibits non-negligible image understanding errors. These errors propagate to the VLM responses, resulting in sub-optimal performance. In our work, we propose an efficient and robust method for updating vision encoders within VLMs. Our approach selectively and locally updates encoders, leading to substantial performance improvements on data where previous mistakes occurred, while maintaining overall robustness. Furthermore, we demonstrate the effectiveness of our method during continual few-shot updates. Theoretical grounding, generality, and computational efficiency characterize our approach.
翻译:视觉语言模型(VLM)在视觉问答和图像描述任务中展现出卓越的能力,成为连接视觉与语言模型的关键桥梁。然而,现有的开源视觉语言模型严重依赖预训练且固定的视觉编码器(如CLIP)。尽管CLIP在不同领域表现出较强的鲁棒性,其仍存在不可忽视的图像理解误差。这些误差会传递至视觉语言模型的响应中,导致次优性能。本研究提出一种高效且鲁棒的方法,用于更新视觉语言模型中的视觉编码器。该方法对编码器进行选择性局部更新,在先前出现错误的数据上实现显著的性能提升,同时保持整体鲁棒性。此外,我们验证了该方法在持续少样本更新场景下的有效性。我们的方法具有理论依据、普适性强且计算效率高的特点。