Vision-Language Models (VLMs) are expected to be capable of reasoning with commonsense knowledge as human beings. One example is that humans can reason where and when an image is taken based on their knowledge. This makes us wonder if, based on visual cues, Vision-Language Models that are pre-trained with large-scale image-text resources can achieve and even outperform human's capability in reasoning times and location. To address this question, we propose a two-stage \recognition\space and \reasoning\space probing task, applied to discriminative and generative VLMs to uncover whether VLMs can recognize times and location-relevant features and further reason about it. To facilitate the investigation, we introduce WikiTiLo, a well-curated image dataset compromising images with rich socio-cultural cues. In the extensive experimental studies, we find that although VLMs can effectively retain relevant features in visual encoders, they still fail to make perfect reasoning. We will release our dataset and codes to facilitate future studies.
翻译:视觉语言模型(VLMs)被期望能够像人类一样运用常识进行推理。例如,人类可以基于自身知识推断出图像拍摄的地点和时间。这引发了我们思考:基于视觉线索,经过大规模图文数据预训练的视觉语言模型,能否达到甚至超越人类在时间与地点推理方面的能力?为解答这一问题,我们提出了一个两阶段的"识别与推理"探查任务,应用于判别式和生成式VLMs,以揭示VLMs能否识别时间与地点相关特征并进一步进行推理。为便于研究,我们引入了WikiTiLo——一个精心整理的图像数据集,其中包含丰富的 socio-culture 线索。在大量实验研究中,我们发现尽管VLMs能够有效保留视觉编码器中的相关特征,但它们仍无法进行完美的推理。我们将公开数据集和代码以供后续研究。