Image-based advertisements are complex multimodal stimuli that often contain unusual visual elements and figurative language. Previous research on automatic ad understanding has reported impressive zero-shot accuracy of contrastive vision-and-language models (VLMs) on an ad-explanation retrieval task. Here, we examine the original task setup and show that contrastive VLMs can solve it by exploiting grounding heuristics. To control for this confound, we introduce TRADE, a new evaluation test set with adversarial grounded explanations. While these explanations look implausible to humans, we show that they "fool" four different contrastive VLMs. Our findings highlight the need for an improved operationalisation of automatic ad understanding that truly evaluates VLMs' multimodal reasoning abilities. We make our code and TRADE available at https://github.com/dmg-illc/trade .
翻译:基于图像的广告是复杂的多模态刺激,通常包含非常规视觉元素和比喻性语言。先前关于自动广告理解的研究报告了对比视觉语言模型在广告解释检索任务上令人印象深刻的零样本准确率。本文通过检验原始任务设置,证明对比视觉语言模型可通过利用基础启发式方法解决该任务。为控制这一混淆因素,我们提出了TRADE——一个包含对抗性基础解释的新评估测试集。尽管这些解释对人类而言显得不合逻辑,但我们证明它们能够"欺骗"四种不同的对比视觉语言模型。我们的研究结果强调,需要改进自动广告理解的实施方式,以真正评估视觉语言模型的多模态推理能力。相关代码与TRADE数据集已在https://github.com/dmg-illc/trade公开。