Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in visual comprehension due to inadequate visual-centric supervision, which leads to inaccurate interpretation of math figures. To address this issue, we propose a two-step training pipeline VCAR, which emphasizes the Visual Comprehension training in Addition to mathematical Reasoning learning. It first improves the visual comprehension ability of MLLMs through the visual description generation task, followed by another training step on generating rationales with the assistance of descriptions. Experimental results on two popular benchmarks demonstrate that VCAR substantially outperforms baseline methods solely relying on rationale supervision, especially on problems with high visual demands.
翻译:开源的多模态大语言模型(MLLMs)在处理涉及文本和视觉输入的多项任务中表现出色,但仍在复杂的多模态数学推理中存在困难,落后于GPT-4V(ision)和Gemini-Pro等专有模型。尽管通过中间步骤(即推理依据)进行微调能激发一定的数学推理能力,但由于缺乏以视觉为中心的充分监督,生成的模型在视觉理解方面仍存在不足,导致对数学图形的解读不准确。为解决这一问题,我们提出一种两阶段训练流程VCAR,强调在数学推理学习之外额外进行视觉理解训练。该方法首先通过视觉描述生成任务提升MLLMs的视觉理解能力,随后借助描述进行推理依据生成的训练步骤。在两个主流基准上的实验结果表明,VCAR显著优于仅依赖推理依据监督的基线方法,尤其在高视觉需求的问题上表现突出。