Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences, we ask: are today's AI capable of similar understanding? We present VisArgs, a dataset of 1,611 images annotated with 5,112 visual premises (with regions), 5,574 commonsense premises, and reasoning trees connecting them into structured arguments. We propose three tasks for evaluating visual argument understanding: premise localization, premise identification, and conclusion deduction. Experiments show that 1) machines struggle to capture visual cues: GPT-4-O achieved 78.5% accuracy, while humans reached 98.0%. Models also performed 19.5% worse when distinguishing between irrelevant objects within the image compared to external objects. 2) Providing relevant visual premises improved model performance significantly.
翻译:视觉论证常用于广告或社会公益领域,其通过图像说服观众采取行动或相信某种观点。理解这类论证需要选择性视觉:图像中仅有特定的视觉刺激与论证相关,且相关性必须置于更广泛的论证结构语境中才能被理解。尽管人类观众能轻松领会视觉论证,我们不禁要问:当前的人工智能是否具备类似的理解能力?本文提出VisArgs数据集,包含1,611张图像,标注了5,112个视觉前提(含区域标注)、5,574个常识前提,并通过推理树将其连接为结构化论证。我们设计了三个评估视觉论证理解能力的任务:前提定位、前提识别与结论推导。实验表明:1)机器难以捕捉视觉线索:GPT-4-O准确率为78.5%,而人类达到98.0%。在区分图像内部无关对象与外部对象时,模型性能额外下降19.5%。2)提供相关视觉前提能显著提升模型表现。