Multimodal Large Language Models (MLLMs) have recently shown remarkable perceptual capability in answering visual questions, however, little is known about the limits of their perception. In particular, while prior works have provided anecdotal evidence of MLLMs' sensitivity to object size, this phenomenon and its underlying causes have not been explored comprehensively. In this work, we quantitatively study the perception of small visual objects in several state-of-the-art MLLMs and reveal a pervasive limitation in answering questions about small objects in images. Next, we identify four independent factors that can contribute to this limitation -- object quality, size, distractors, and location -- and conduct controlled intervention studies to measure the effect of each factor on MLLMs' perception. In particular, we find that lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, we find that the location of the object in the image and the presence of visual distractors can also significantly reduce MLLMs' question answering accuracy. Our study provides a better understanding of the perceptual limitation of MLLMs and contributes new evaluation protocols for analyzing the perception of future MLLMs. To facilitate further investigations, we release our code and data.
翻译:多模态大语言模型(MLLMs)近期在视觉问答任务中展现出卓越的感知能力,然而其感知局限性鲜为人知。特别地,虽然已有研究提供了MLLMs对目标尺寸敏感性的轶事证据,但该现象及其根本成因尚未得到系统探究。本文对多个先进MLLMs的小尺寸视觉目标感知能力进行定量研究,揭示了其在回答图像中小目标相关问题时普遍存在的局限性。随后,我们识别出四个独立的致因因素——目标质量、尺寸、干扰物和空间位置——并通过受控干预实验测量各因素对MLLMs感知能力的影响效应。具体而言,研究发现较低的目标质量与更小的目标尺寸均会独立降低MLLMs的视觉问答能力。更令我们惊讶的是,目标在图像中的空间位置以及视觉干扰物的存在同样会显著降低MLLMs的问答准确率。本研究系统揭示了MLLMs的感知局限性,并为分析未来MLLMs的感知能力提供了新的评估范式。为便于后续研究开展,我们开放了相关代码与数据集。