Multimodal large language models (MLLMs) have demonstrated significant progress in semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on more complex tasks involving mathematics and logic. To bridge this gap, we introduce ReasonMap, a novel benchmark specifically designed to evaluate these capabilities. ReasonMap encompasses high-resolution transit maps from 30 cities and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Our comprehensive evaluation of 16 popular MLLMs reveals a counterintuitive pattern: among open-source models, base variants outperform their reasoning-tuned counterparts, whereas the opposite trend is observed in closed-source models. Further analysis under the visual-masking setting confirms that strong performance necessitates direct visual grounding, rather than relying solely on language priors. We further establish a training baseline with reinforcement fine-tuning, providing a reference for future exploration. We hope this benchmark study offers new insights into visual reasoning and helps investigate the gap between open- and closed-source models.
翻译:多模态大语言模型(MLLMs)在语义场景理解和图文对齐方面取得了显著进展,其推理增强版本在涉及数学与逻辑的复杂任务上表现出更优性能。为填补现有评估空白,我们提出了ReasonMap——一个专门设计用于评估上述能力的新型基准。ReasonMap涵盖来自30个城市的高分辨率交通地图,包含1,008个跨两种问题类型与三种模板的问答对。此外,我们设计了一个双层评估流程,以准确评估答案的正确性与质量。通过对16个主流MLLMs的全面评估,我们发现了一个反直觉的模式:在开源模型中,基础版本的表现优于经过推理调优的变体,而在闭源模型中则呈现相反趋势。在视觉掩蔽设置下的进一步分析证实,优异性能需要直接的视觉信息支撑,而非仅依赖语言先验。我们进一步通过强化微调建立了训练基线,为未来探索提供参考。我们希望这项基准研究能为视觉推理领域带来新见解,并助力探究开源与闭源模型之间的性能差距。