Large Multimodal Models (LMMs) have achieved impressive success in visual understanding and reasoning, remarkably improving the performance of mathematical reasoning in a visual context. Yet, a challenging type of visual math lies in the multimodal graph theory problem, which demands that LMMs understand the graphical structures accurately and perform multi-step reasoning on the visual graph. Additionally, exploring multimodal graph theory problems will lead to more effective strategies in fields like biology, transportation, and robotics planning. To step forward in this direction, we are the first to design a benchmark named VisionGraph, used to explore the capabilities of advanced LMMs in solving multimodal graph theory problems. It encompasses eight complex graph problem tasks, from connectivity to shortest path problems. Subsequently, we present a Description-Program-Reasoning (DPR) chain to enhance the logical accuracy of reasoning processes through graphical structure description generation and algorithm-aware multi-step reasoning. Our extensive study shows that 1) GPT-4V outperforms Gemini Pro in multi-step graph reasoning; 2) All LMMs exhibit inferior perception accuracy for graphical structures, whether in zero/few-shot settings or with supervised fine-tuning (SFT), which further affects problem-solving performance; 3) DPR significantly improves the multi-step graph reasoning capabilities of LMMs and the GPT-4V (DPR) agent achieves SOTA performance.
翻译:大型多模态模型(LMMs)在视觉理解与推理方面取得了显著成功,极大提升了视觉上下文中数学推理的性能。然而,视觉数学中具有挑战性的一类问题是多模态图论问题,这要求LMMs准确理解图结构并在视觉图上执行多步推理。此外,探索多模态图论问题将为生物学、交通和机器人规划等领域带来更有效的策略。为推进该方向研究,我们首次设计了一个名为VisionGraph的基准测试,用于探究先进LMMs解决多模态图论问题的能力。该基准包含从连通性到最短路径问题的八项复杂图问题任务。随后,我们提出了描述-编程-推理(DPR)链,通过图结构描述生成和算法感知的多步推理来增强推理过程的逻辑准确性。我们的广泛研究表明:1)在多步图推理任务中,GPT-4V优于Gemini Pro;2)所有LMMs在零样本/少样本设置或有监督微调(SFT)下,对图结构的感知准确性均较低,这进一步影响了问题求解性能;3)DPR显著提升了LMMs的多步图推理能力,且GPT-4V (DPR)智能体达到了最新的最优性能(SOTA)。