Images often communicate more than they literally depict: a set of tools can suggest an occupation and a cultural artifact can suggest a tradition. This kind of indirect visual reference, known as visual metonymy, invites viewers to recover a target concept via associated cues rather than explicit depiction. In this work, we present the first computational investigation of visual metonymy. We introduce a novel pipeline grounded in semiotic theory that leverages large language models and text-to-image models to generate metonymic visual representations. Using this framework, we construct ViMET, the first visual metonymy dataset comprising 2,000 multiple-choice questions to evaluate the cognitive reasoning abilities in multimodal language models. Experimental results on our dataset reveal a significant gap between human performance (86.9%) and state-of-the-art vision-language models (65.9%), highlighting limitations in machines' ability to interpret indirect visual references. Our dataset is publicly available at: https://github.com/cincynlp/ViMET.
翻译:图像所传达的信息往往超出其字面描绘:一组工具可以暗示某种职业,一件文化器物可以暗示某种传统。这种间接的视觉指涉被称为视觉转喻,它引导观者通过关联线索而非直接描绘来理解目标概念。本研究首次对视觉转喻展开计算化探索。我们提出一种基于符号学理论的新型流程,该流程利用大语言模型和文生图模型生成转喻性视觉表征。基于此框架,我们构建了首个视觉转喻数据集ViMET,包含2000道多项选择题,用于评估多模态语言模型的认知推理能力。在该数据集上的实验结果显示:人类表现(86.9%)与前沿视觉语言模型表现(65.9%)存在显著差距,这揭示了机器在理解间接视觉指涉能力上的局限性。本数据集已公开于:https://github.com/cincynlp/ViMET。