Multimodal Small-to-Medium sized Language Models (MSLMs) have demonstrated strong capabilities in integrating visual and textual information but still face significant limitations in visual comprehension and mathematical reasoning, particularly in geometric problems with diverse levels of visual infusion. Current models struggle to accurately decompose intricate visual inputs and connect perception with structured reasoning, leading to suboptimal performance. To address these challenges, we propose SpatialMath, a novel Spatial Comprehension-Infused Symbolic Reasoning Framework designed to integrate spatial representations into structured symbolic reasoning chains. SpatialMath employs a specialized perception module to extract spatially-grounded representations from visual diagrams, capturing critical geometric structures and spatial relationships. These representations are then methodically infused into symbolic reasoning chains, facilitating visual comprehension-aware structured reasoning. To this end, we introduce MATHVERSE-PLUS, a novel dataset containing structured visual interpretations and step-by-step reasoning paths for vision-intensive mathematical problems. SpatialMath significantly outperforms strong multimodal baselines, achieving up to 10 percentage points improvement over supervised fine-tuning with data augmentation in vision-intensive settings. Robustness analysis reveals that enhanced spatial representations directly improve reasoning accuracy, reinforcing the need for structured perception-to-reasoning pipelines in MSLMs.
翻译:多模态中小型语言模型(MSLMs)在整合视觉与文本信息方面展现出强大能力,但在视觉理解与数学推理方面仍存在显著局限,尤其在视觉信息融合程度各异的几何问题上。现有模型难以准确分解复杂的视觉输入,并将感知与结构化推理相连接,导致性能欠佳。为应对这些挑战,我们提出SpatialMath——一种新颖的融合空间理解的符号推理框架,旨在将空间表征整合至结构化符号推理链中。SpatialMath采用专用感知模块从视觉图表中提取基于空间的表征,捕获关键几何结构与空间关系。这些表征随后被系统性地注入符号推理链,从而促进视觉理解感知的结构化推理。为此,我们构建了MATHVERSE-PLUS数据集,该数据集包含针对视觉密集型数学问题的结构化视觉解读与逐步推理路径。实验表明,SpatialMath显著优于现有强大多模态基线模型,在视觉密集型场景中相比数据增强的监督微调方法提升达10个百分点。鲁棒性分析进一步揭示,增强的空间表征能直接提升推理准确率,这印证了在MSLMs中构建结构化感知-推理管道的必要性。