Although recent LMMs have become much stronger at visual perception, they remain unreliable on problems that require multi-step reasoning over visual evidence. In this paper, we present UnAC (Understanding, Abstracting, and Checking), a multimodal prompting method that strengthens reasoning for complex multimodal tasks in LMMs (e.g., GPT-4o, Gemini 1.5, and GPT-4V). To improve image understanding and capture fine details, we propose an adaptive visual prompting strategy that enables LMMs to focus on salient regions. We further design an image-abstraction prompt to effectively extract key information from images. In addition, we introduce a gradual self-checking scheme that improves reasoning by verifying each decomposed subquestion and its answer. Extensive experiments on three public benchmarks-MathVista, MM-Vet, and MMMU.
翻译:尽管近期的大型多模态模型在视觉感知能力上显著增强,但在需要基于视觉证据进行多步推理的问题上仍存在可靠性不足。本文提出UnAC(理解、抽象与检查)方法——一种多模态提示策略,旨在增强LMMs(如GPT-4o、Gemini 1.5和GPT-4V)在复杂多模态任务中的推理能力。为提升图像理解并捕捉细节,我们提出自适应视觉提示策略,使LMMs能够聚焦关键区域,并设计图像抽象提示以高效提取图像核心信息。此外,我们引入渐进式自我检查机制,通过验证每个分解的子问题及其答案来优化推理过程。在MathVista、MM-Vet和MMMU三个公开基准数据集上的广泛实验验证了该方法的有效性。