Multimodal Large Language Models (MLLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) -- a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate their limitations in dealing with different image and question properties. In this work, we investigate whether MLLMs can perceive details as well as larger components in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject related to the question, declining up to $45.91\%$ with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. To scale up the usefulness of human cropping, we propose ViCrop, a general framework that utilizes automatic visual cropping to enhance zero-shot VQA of MLLMs. We construct five variants of ViCrop leveraging either external localization models or the decision process of the given MLLM itself. Our results show that ViCrop improves MLLMs' zero-shot accuracy across different VQA datasets, for example, enhances BLIP2-T5's performance by $32.23\%$ on the TextVQA test set. To facilitate further investigation of MLLMs' behaviors, our code is publicly released.
翻译:多模态大语言模型(MLLMs)近期在视觉问答(VQA)任务中取得了令人瞩目的零样本准确率,该基础任务影响着多种下游应用与领域。鉴于这些模型广泛应用的巨大潜力,研究其在处理不同图像与问题属性时的局限性至关重要。本文探究了MLLMs能否像感知图像中较大组件一样感知细节。特别地,我们证明了其回答视觉问题的零样本准确率对与问题相关的视觉主体尺寸高度敏感,准确率随尺寸变化最多下降45.91%。此外,通过观察人工视觉裁剪可显著缓解其对尺寸的敏感性,我们证实了该因果效应。为扩展人工裁剪的实用性,我们提出ViCrop——一个利用自动化视觉裁剪增强MLLMs零样本VQA的通用框架。我们构建了五种ViCrop变体,分别利用外部定位模型或给定MLLM自身的决策过程。实验结果表明,ViCrop在不同VQA数据集上均能提升MLLMs的零样本准确率,例如在TextVQA测试集上使BLIP2-T5的性能提升32.23%。为促进对MLLMs行为的进一步研究,我们公开了代码。