Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.
翻译:视觉语言模型(VLM)在短短几年内彻底改变了计算机视觉模型的格局,开启了从零样本图像分类到图像描述和视觉问答等一系列令人振奋的新应用。与纯视觉模型不同,它们提供了一种通过语言提示直观访问视觉内容的方式。这类模型的广泛适用性促使我们思考:它们是否也与人类视觉一致——具体而言,它们通过多模态融合在多大程度上采纳了人类诱导的视觉偏差,还是仅仅继承了纯视觉模型的偏差?一个重要的视觉偏差是纹理与形状偏差,即局部信息对全局信息的主导地位。本文研究了一系列流行VLM中的这一偏差。有趣的是,我们发现VLM通常比其视觉编码器更偏向形状,这表明在多模态模型中,视觉偏差在一定程度上通过文本得到了调节。如果文本确实影响视觉偏差,那么我们或许不仅可以通过视觉输入,还可以通过语言来引导视觉偏差:这一假设通过大量实验得到了证实。例如,仅通过提示,我们就能将形状偏差从低至49%引导到高达72%。然而,目前所有测试的VLM均无法达到人类对形状的强烈偏差(96%)。