MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

Weikang Shi,Aldrich Yu,Rongyao Fang,Houxing Ren,Ke Wang,Aojun Zhou,Changyao Tian,Xinyu Fu,Yuxuan Hu,Zimu Lu,Linjiang Huang,Si Liu,Rui Liu,Hongsheng Li

from arxiv, Project Page: https://mathcanvas.github.io/

While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/

翻译：尽管大型语言模型（LLMs）在文本推理方面表现出色，但在几何等本质上依赖视觉辅助的数学领域仍存在困难。现有的视觉思维链（VCoT）方法常受限于僵化的外部工具，或无法生成复杂问题求解所需的高保真、策略性定时的图表。为弥合这一差距，我们提出了MathCanvas——一个旨在为统一的大型多模态模型（LMMs）赋予内在数学VCoT能力的综合框架。我们的方法包含两个阶段。首先，在视觉操控阶段，模型在一个新颖的1520万对语料库上进行预训练，该语料库包含1000万图文对（MathCanvas-Imagen）和520万逐步编辑轨迹（MathCanvas-Edit），以掌握图表生成与编辑。其次，在策略性视觉辅助推理阶段，模型在MathCanvas-Instruct上进行微调，这是一个包含21.9万个交织式视觉-文本推理路径的新数据集，旨在教会模型何时及如何利用视觉辅助。为促进严谨评估，我们推出了MathCanvas-Bench，这是一个包含3000个需要模型生成交织式视觉-文本解决方案的挑战性基准测试。在此框架下训练的模型BAGEL-Canvas，在MathCanvas-Bench上相较于强大的LMM基线实现了86%的相对性能提升，并在其他公共数学基准测试中展现出优异的泛化能力。我们的工作提供了一套完整的工具包——包括框架、数据集和基准测试——以解锁LMM中复杂、类人的视觉辅助推理能力。项目页面：https://mathcanvas.github.io/

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日