Pretrained large Vision-Language models have drawn considerable interest in recent years due to their remarkable performance. Despite considerable efforts to assess these models from diverse perspectives, the extent of visual cultural awareness in the state-of-the-art GPT-4V model remains unexplored. To tackle this gap, we extensively probed GPT-4V using the MaRVL benchmark dataset, aiming to investigate its capabilities and limitations in visual understanding with a focus on cultural aspects. Specifically, we introduced three visual related tasks, i.e. caption classification, pairwise captioning, and culture tag selection, to systematically delve into fine-grained visual cultural evaluation. Experimental results indicate that GPT-4V excels at identifying cultural concepts but still exhibits weaker performance in low-resource languages, such as Tamil and Swahili. Notably, through human evaluation, GPT-4V proves to be more culturally relevant in image captioning tasks than the original MaRVL human annotations, suggesting a promising solution for future visual cultural benchmark construction.
翻译:近年来,预训练的大规模视觉-语言模型因其卓越性能引发了广泛关注。尽管已有大量研究从不同角度评估这类模型,但最先进的GPT-4V模型在视觉文化意识方面达到何种程度仍待探索。为填补这一空白,我们利用MaRVL基准数据集对GPT-4V进行了全面探测,旨在系统考察其视觉理解能力与局限性,重点关注文化维度。具体而言,我们设计了三项视觉相关任务(即标题分类、成对标题生成和文化标签选择),以深入剖析细粒度的视觉文化评估。实验结果表明,GPT-4V在识别文化概念方面表现优异,但在泰米尔语、斯瓦希里语等低资源语言上仍显薄弱。值得注意的是,通过人工评估,GPT-4V在图像描述任务中展现出比原始MaRVL人工标注更强的文化相关性,这为未来视觉文化基准构建提供了具有前景的解决方案。