In this article, we investigate vision-language models (VLM) as reasoners. The ability to form abstractions underlies mathematical reasoning, problem-solving, and other Math AI tasks. Several formalisms have been given to these underlying abstractions and skills utilized by humans and intelligent systems for reasoning. Furthermore, human reasoning is inherently multimodal, and as such, we focus our investigations on multimodal AI. In this article, we employ the abstractions given in the SMART task (Simple Multimodal Algorithmic Reasoning Task) introduced in \cite{cherian2022deep} as meta-reasoning and problem-solving skills along eight axes: math, counting, path, measure, logic, spatial, and pattern. We investigate the ability of vision-language models to reason along these axes and seek avenues of improvement. Including composite representations with vision-language cross-attention enabled learning multimodal representations adaptively from fused frozen pretrained backbones for better visual grounding. Furthermore, proper hyperparameter and other training choices led to strong improvements (up to $48\%$ gain in accuracy) on the SMART task, further underscoring the power of deep multimodal learning. The smartest VLM, which includes a novel QF multimodal layer, improves upon the best previous baselines in every one of the eight fundamental reasoning skills. End-to-end code is available at https://github.com/smarter-vlm/smarter.
翻译:本文研究了视觉-语言模型作为推理器的能力。形成抽象的能力是数学推理、问题解决及其他数学人工智能任务的基础。人类与智能系统在推理过程中所运用的底层抽象与技能已有多种形式化描述。此外,人类推理本质上是多模态的,因此我们将研究重点集中于多模态人工智能。本文采用 \cite{cherian2022deep} 提出的 SMART 任务中定义的抽象概念作为元推理与问题解决技能,涵盖八个维度:数学、计数、路径、度量、逻辑、空间与模式。我们探究了视觉-语言模型沿这些维度进行推理的能力,并寻求改进途径。通过融合视觉-语言交叉注意力机制的复合表征,能够从冻结预训练骨干网络的自适应融合中学习多模态表征,从而增强视觉基础。此外,恰当的超参数选择与训练策略使 SMART 任务性能显著提升(准确率最高提升 $48\%$),进一步印证了深度多模态学习的强大能力。包含新型 QF 多模态层的最优视觉-语言模型,在全部八项基础推理技能上均超越了先前最佳基线模型。端到端代码已发布于 https://github.com/smarter-vlm/smarter。