While visual question-answering (VQA) benchmarks have catalyzed the development of reasoning techniques, they have focused on vertical thinking. Effective problem-solving also necessitates lateral thinking, which remains understudied in AI and has not been used to test visual perception systems. To bridge this gap, we formulate visual lateral thinking as a multiple-choice question-answering task and describe a three-step taxonomy-driven methodology for instantiating task examples. Then, we develop COLUMBUS, a synthetic benchmark that applies the task pipeline to create QA sets with text and icon rebus puzzles based on publicly available collections of compounds and common phrases. COLUMBUS comprises over 1,000 puzzles, each with four answer candidates. While the SotA vision-language models (VLMs) achieve decent performance, our evaluation demonstrates a substantial gap between humans and models. VLMs benefit from human-curated descriptions but struggle to self-generate such representations at the right level of abstraction.
翻译:尽管视觉问答(VQA)基准推动了推理技术的发展,但其主要关注纵向思维。有效的问题解决同样需要横向思维,而这一能力在人工智能领域尚未得到充分研究,也未被用于测试视觉感知系统。为填补这一空白,我们将视觉横向思维形式化为多项选择问答任务,并提出一种基于分类的三步方法用于生成任务实例。随后,我们开发了COLUMBUS——一个通过该任务流程创建的合成基准数据集,其基于公开可获取的复合词和常用短语集合,构建了包含文字与图标的谜题问答集。COLUMBUS包含超过1000个谜题,每个谜题配有四个候选答案。尽管当前最先进的视觉语言模型(VLM)取得了尚可的表现,但我们的评估揭示了人类与模型之间存在显著差距。视觉语言模型能够受益于人工标注的描述,但在自主生成具有合适抽象层次的表征方面仍面临困难。