Recently, Pretrained Language Models (PLMs) have been serving as general-purpose interfaces, posing a significant demand for comprehensive visual knowledge. However, it remains unclear how well current PLMs and their visually augmented counterparts (VaLMs) can master visual commonsense knowledge. To investigate this, we propose ImageNetVC, a fine-grained, human-annotated dataset specifically designed for zero-shot visual commonsense evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we delve into the fundamental visual commonsense knowledge of both unimodal PLMs and VaLMs, uncovering the scaling law and the influence of the backbone model on VaLMs. Furthermore, we investigate the factors affecting the visual commonsense knowledge of large-scale models, providing insights into the development of language models enriched with visual commonsense knowledge. Our code and dataset are available at https://github.com/hemingkx/ImageNetVC.
翻译:近年来,预训练语言模型(PLM)逐渐作为通用接口被使用,这对全面的视觉知识提出了显著需求。然而,当前的PLM及其视觉增强版本(VaLM)在掌握视觉常识知识方面的能力尚不明确。为探究此问题,我们提出了ImageNetVC——一个专门针对1000个ImageNet类别进行零样本视觉常识评估的细粒度人工标注数据集。借助ImageNetVC,我们深入研究了单模态PLM和VaLM的基础视觉常识知识,揭示了缩放定律以及骨干模型对VaLM的影响。此外,我们探究了影响大规模模型视觉常识知识的因素,为开发富含视觉常识知识的语言模型提供了见解。我们的代码和数据集发布于https://github.com/hemingkx/ImageNetVC。