Chain-of-Thought (CoT) reasoning has extended from purely linguistic domains to multimodal scenarios; however, existing approaches often treat visual inputs as homogeneous or auxiliary signals, failing to capture the intricate and sample-specific dependencies between text and images in mathematical problem-solving. This gives rise to two core issues: first, the supervisory signals for visual content are generalized and coarse-grained, lacking adaptation to the actual necessity of visual information in each sample; second, training feedback becomes inaccurate when visual rewards are uniformly applied without distinguishing the complementary relationships among inputs. These limitations hinder models from achieving precise multimodal reasoning. In this work, we propose a framework for modeling fine-grained visual dependencies in mathematical reasoning. We first construct the MathVis-Fine dataset, augmenting fine-grained visual annotations with visual dependency ratings. Building upon this dataset, we introduce a two-stage progressive visual enhancement training paradigm that balances answer correctness rewards and visual grounding rewards according to the intrinsic visual dependency level of each sample, thereby mitigating reward bias and improving supervision accuracy. Extensive experiments demonstrate that the MathVis-Fine framework effectively enhances visual perception progressively based on visual dependency, offering a more precise training framework for multimodal mathematical reasoning. We will release the dataset upon acceptance.
翻译:链式思维推理已从纯语言领域扩展到多模态场景;然而,现有方法常将视觉输入视为同质或辅助信号,未能捕捉数学问题求解中文本与图像之间复杂且样本特定的依赖关系。这引发两个核心问题:首先,视觉内容的监督信号过于泛化和粗粒度,缺乏对每个样本中视觉信息实际必要性的自适应;其次,当视觉奖励被统一应用而不区分输入间的互补关系时,训练反馈变得不准确。这些局限性阻碍了模型实现精确的多模态推理。本文提出一个建模数学推理中细粒度视觉依赖关系的框架。我们首先构建MathVis-Fine数据集,通过视觉依赖评级增强细粒度视觉标注。在此基础上,引入两阶段渐进式视觉增强训练范式,该范式根据每个样本固有的视觉依赖程度平衡答案正确性奖励与视觉基础奖励,从而缓解奖励偏差并提高监督精度。大量实验表明,MathVis-Fine框架能有效基于视觉依赖逐步增强视觉感知,为多模态数学推理提供了一个更精确的训练框架。数据集将在论文接收后发布。