Pretrained large Vision-Language models have drawn considerable interest in recent years due to their remarkable performance. Despite considerable efforts to assess these models from diverse perspectives, the extent of visual cultural awareness in the state-of-the-art GPT-4V model remains unexplored. To tackle this gap, we extensively probed GPT-4V using the MaRVL benchmark dataset, aiming to investigate its capabilities and limitations in visual understanding with a focus on cultural aspects. Specifically, we introduced three visual related tasks, i.e. caption classification, pairwise captioning, and culture tag selection, to systematically delve into fine-grained visual cultural evaluation. Experimental results indicate that GPT-4V excels at identifying cultural concepts but still exhibits weaker performance in low-resource languages, such as Tamil and Swahili. Notably, through human evaluation, GPT-4V proves to be more culturally relevant in image captioning tasks than the original MaRVL human annotations, suggesting a promising solution for future visual cultural benchmark construction.
翻译:近年来,预训练的大规模视觉-语言模型因其卓越的性能引起了广泛关注。尽管已有大量研究从不同角度评估这类模型,但当前最先进的GPT-4V模型在视觉文化意识方面的能力仍未得到充分探索。为弥补这一空白,我们利用MaRVL基准数据集对GPT-4V进行了系统性探查,重点从文化维度探究其视觉理解的潜力与局限。具体而言,我们设计了三个视觉相关任务——图像描述分类、配对图像描述生成及文化标签选择,以深入实现细粒度的视觉文化评估。实验结果表明,GPT-4V在识别文化概念方面表现优异,但在泰米尔语和斯瓦希里语等低资源语言场景中仍显不足。值得注意的是,人类评估显示,GPT-4V在图像描述任务中的文化相关性甚至优于原始MaRVL的人工标注结果,这为未来视觉文化基准的构建提供了极具前景的解决方案。