Multimodal Mathematical Reasoning (MMR) has recently attracted increasing attention for its capability to solve mathematical problems involving both textual and visual modalities. However, current models still face significant challenges in real-world visual math tasks, often misinterpreting diagrams, failing to align mathematical symbols with visual evidence, or producing inconsistent reasoning steps. Moreover, existing evaluations mainly focus on checking final answers rather than verifying the correctness or executability of each intermediate step. A growing body of recent research addresses these issues by integrating structured perception, explicit alignment, and verifiable reasoning within unified frameworks. To establish a clear roadmap for understanding and comparing different MMR approaches, we systematically review them around four fundamental questions: (1) What to extract from multimodal inputs, (2) How to represent and align textual and visual information, (3) How to perform the reasoning, and (4) How to evaluate the correctness of the overall reasoning process. Finally, we discuss open challenges and share our thoughts on future research directions.
翻译:多模态数学推理(MMR)近年来因解决涉及文本与视觉两种模态的数学问题能力而受到越来越多的关注。然而,当前模型在现实世界的视觉数学任务中仍面临重大挑战,常常误读图表、无法将数学符号与视觉证据对齐,或产生不一致的推理步骤。此外,现有评估主要关注检查最终答案,而非验证每个中间步骤的正确性或可执行性。近期越来越多研究通过将结构化感知、显式对齐和可验证推理整合到统一框架中来应对这些问题。为了建立理解与比较不同MMR方法的清晰路线图,我们围绕四个基本问题对其进行系统综述:(1)从多模态输入中提取什么,(2)如何表示并对齐文本与视觉信息,(3)如何执行推理,以及(4)如何评估整体推理过程的正确性。最后,我们讨论了开放挑战,并分享了对未来研究方向的思考。