Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine-tuning. Empirical evaluations demonstrate that fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at https://github.com/xuyige/CrossMath.

翻译：视觉语言模型（VLM）中的推理能力因其在各类下游任务中的广泛应用而备受关注。然而，尚不明确VLM的优越表现是源于真正的视觉基础推理，还是主要依赖其文本骨干网络的推理能力。为系统评估这一问题，我们提出CrossMath——一种用于受控跨模态比较的新型多模态推理基准。具体而言，我们为每个问题构建纯文本、纯图像及图像+文本三种格式，确保任务相关信息完全一致（经人工标注者验证）。这种严格的对齐方法有效隔离了模态特异性推理差异，同时消除了信息不匹配等混淆因素。对先进VLM的广泛评估揭示了一个一致现象：文本推理与视觉推理之间存在显著性能差距。值得注意的是，VLM在纯文本输入下表现优异，而融入视觉数据（图像+文本）时，性能反而常低于纯文本基线。这些发现表明，当前VLM主要在文本空间中进行推理，对视觉证据的真实依赖有限。为缓解这一局限，我们整理了一个CrossMath训练集用于VLM微调。实验评估表明，在该训练集上微调能显著提升所有单一及联合模态的推理性能，并在两项通用视觉推理任务上带来稳健增益。源代码见https://github.com/xuyige/CrossMath。