Vision-Language Models (VLMs) are expected to be capable of reasoning with commonsense knowledge as human beings. One example is that humans can reason where and when an image is taken based on their knowledge. This makes us wonder if, based on visual cues, Vision-Language Models that are pre-trained with large-scale image-text resources can achieve and even outperform human's capability in reasoning times and location. To address this question, we propose a two-stage \recognition\space and \reasoning\space probing task, applied to discriminative and generative VLMs to uncover whether VLMs can recognize times and location-relevant features and further reason about it. To facilitate the investigation, we introduce WikiTiLo, a well-curated image dataset compromising images with rich socio-cultural cues. In the extensive experimental studies, we find that although VLMs can effectively retain relevant features in visual encoders, they still fail to make perfect reasoning. We will release our dataset and codes to facilitate future studies.
翻译:视觉-语言模型(VLMs)本应具备像人类一样运用常识知识进行推理的能力。例如,人类可以根据自身知识推断一张照片的拍摄地点与时间。这引发我们的思考:基于视觉线索,那些在大规模图文资源上预训练的视觉-语言模型能否达到甚至超越人类在时间与地点推理方面的能力?为探究这一问题,我们提出了一种两阶段的"识别"与"推理"探索任务,应用于判别式与生成式VLM,以揭示模型能否识别时间与地点相关特征并进一步进行推理。为促进研究,我们引入了WikiTiLo——一个精心整理、包含丰富社会文化线索的图像数据集。在大量实验研究中发现,尽管VLM能有效保留视觉编码器中的相关特征,但其仍无法实现完美推理。我们将公开数据集与代码以推动后续研究。