CodePercept: Code-Grounded Visual STEM Perception for MLLMs - 专知论文

会员服务 ·

0

代码 · 缩放 · Nature · Engineering · Analysis ·

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

翻译：暂无翻译

Tongkun Guan,Zhibo Yang,Jianqiang Wan,Mingkun Yang,Zhengtao Guo,Zijian Hu,Ruilin Luo,Ruize Chen,Songtao Jiang,Peng Wang,Wei Shen,Junyang Lin,Xiaokang Yang

from arxiv, Accepted by CVPR2026

When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

翻译：暂无翻译

0

相关内容

代码（Code）是专知网的一个重要知识资料文档板块，旨在整理收录论文源代码、复现代码，经典工程代码等，便于用户查阅下载使用。

深度学习模型图难画论文难中？这个ML Visual利器帮你快速画出漂亮的模型图,160个模板

深度学习模型图难画论文难中？这个ML Visual利器帮你快速画出漂亮的模型图,160个模板

专知会员服务

904+阅读 · 2022年3月1日

如何将先验知识嵌入机器学习？首篇《知信机器学习Informed ML》综述论文全面概述IML概念、分类、方法等，19页pdf

如何将先验知识嵌入机器学习？首篇《知信机器学习Informed ML》综述论文全面概述IML概念、分类、方法等，19页pdf

专知会员服务

108+阅读 · 2021年6月27日

【MLSS2020】最新《深度强化学习》教程，165页ppt与视频，Mila Doina Precup

【MLSS2020】最新《深度强化学习》教程，165页ppt与视频，Mila Doina Precup

专知会员服务

68+阅读 · 2020年7月12日

【MLSS2020】最新《几何深度学习》教程，帝国理工学院Michael Bronstein教授，166页ppt

【MLSS2020】最新《几何深度学习》教程，帝国理工学院Michael Bronstein教授，166页ppt

专知会员服务

111+阅读 · 2020年7月10日

Yoshua Bengio最新《深度学习》MLSS2020教程，附104页PPT及视频

Yoshua Bengio最新《深度学习》MLSS2020教程，附104页PPT及视频

专知会员服务

134+阅读 · 2020年7月10日

【MLSS2020】最新《元学习》教程，牛津大学Yee Whye Teh教授，165页ppt

【MLSS2020】最新《元学习》教程，牛津大学Yee Whye Teh教授，165页ppt

专知会员服务

137+阅读 · 2020年7月8日

【MLSS2020】大规模机器学习优化，195页ppt，法国Francis Bach研究员

【MLSS2020】大规模机器学习优化，195页ppt，法国Francis Bach研究员

专知会员服务

71+阅读 · 2020年7月4日

【MLSS2020硬核课】机器学习「因果性」，德国Bernhard Schölkopf教授，177页ppt

【MLSS2020硬核课】机器学习「因果性」，德国Bernhard Schölkopf教授，177页ppt

专知会员服务

115+阅读 · 2020年7月2日

【机器学习教程】生物导体MLInterfaces包到基因表达数据的应用，applications of the BioconductorMLInterfaces package to gene expression data

【机器学习教程】生物导体MLInterfaces包到基因表达数据的应用，applications of the BioconductorMLInterfaces package to gene expression data

专知会员服务

18+阅读 · 2020年1月11日

CMU博士论文：可微优化机器学习建模

CMU博士论文：可微优化机器学习建模

专知会员服务

64+阅读 · 2019年10月26日

【MLSS2020】最新《几何深度学习》教程，帝国理工学院Michael Bronstein教授，166页ppt

【MLSS2020】最新《几何深度学习》教程，帝国理工学院Michael Bronstein教授，166页ppt

专知

17+阅读 · 2020年7月10日

《可解释的机器学习-interpretable-ml》中文翻译版

《可解释的机器学习-interpretable-ml》中文翻译版

专知

88+阅读 · 2020年2月23日

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

专知

96+阅读 · 2019年9月30日

【Github】ML-NLP：机器学习、NLP面试中常考到的知识点和代码实现

【Github】ML-NLP：机器学习、NLP面试中常考到的知识点和代码实现

AINLP

10+阅读 · 2019年9月12日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

AI研习社

21+阅读 · 2018年6月14日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

天元数学交流项目“光声与超声联合成像中的相关反演理论及其算法的研究”

国家自然科学基金

2+阅读 · 2017年12月31日

纳米尺度自旋电子器件参数化电路模型建立方法的研究

国家自然科学基金

0+阅读 · 2017年12月31日

新型双组份Camassa-Holm方程的等谱问题及适定性研究

国家自然科学基金

0+阅读 · 2015年12月31日

可控制备的纳米级钨针尖应用于表面缺陷的扫描隧道显微学研究

国家自然科学基金

0+阅读 · 2015年12月31日

2D/3D视觉信息融合仿生SLAM关键问题研究

国家自然科学基金

3+阅读 · 2015年12月31日

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

可与MPSoC高度融合的片上自主测试-自主修复关键技术研究：针对自然、人为可靠性威胁

国家自然科学基金

0+阅读 · 2015年12月31日

面向信息安全芯片的物理不可克隆函数电路建模与实现

国家自然科学基金

0+阅读 · 2014年12月31日

压缩感知和稀疏优化中的非凸优化算法设计

国家自然科学基金

2+阅读 · 2014年12月31日

隐写模糊安全性测度及其优化嵌入算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

Encoder-Decoder Manifold Alignment for Idempotent Generation

Arxiv

0+阅读 · 6月21日

CodeTeam: An LLM-Powered Multi-Agent Framework for Repository-Level Code Generation

Arxiv

0+阅读 · 6月20日

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Arxiv

0+阅读 · 6月18日

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

Arxiv

0+阅读 · 6月18日

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

Arxiv

0+阅读 · 6月18日

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

Arxiv

0+阅读 · 6月17日

REKEY: Metadata-Grounded Visual-Key Regeneration for Contamination-Resilient VQA Evaluation

Arxiv

0+阅读 · 6月17日

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

Arxiv

0+阅读 · 6月16日

LLMs have Visualization Literacy: Now What? Experiments Exploring LLM Visualization Evaluation Capabilities

Arxiv

0+阅读 · 6月13日

An Introduction to Autoencoders

Arxiv

17+阅读 · 2022年1月11日

VIP会员

文章信息

相关主题

最新内容

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

专知会员服务

1+阅读 · 今天14:45

综述 | 世界动作模型：少做梦，多行动

综述 | 世界动作模型：少做梦，多行动

专知会员服务

1+阅读 · 今天14:43

美以伊冲突：无人机与人工智能的运用

美以伊冲突：无人机与人工智能的运用

专知会员服务

3+阅读 · 今天14:31

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

专知会员服务

3+阅读 · 今天14:20

《特种部队在透明战场中的生存力》最新报告

《特种部队在透明战场中的生存力》最新报告

专知会员服务

2+阅读 · 今天14:11

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

专知会员服务

3+阅读 · 今天14:07

《人工智能生成的零日漏洞：对未来作战的影响》

《人工智能生成的零日漏洞：对未来作战的影响》

专知会员服务

3+阅读 · 今天14:03

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

专知会员服务

2+阅读 · 今天13:59

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

专知会员服务

5+阅读 · 6月22日

综述 | 3D场景图：开放挑战与未来方向

综述 | 3D场景图：开放挑战与未来方向

专知会员服务

8+阅读 · 6月22日

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

专知会员服务

7+阅读 · 6月22日

21世纪的无人机战争

21世纪的无人机战争

专知会员服务

4+阅读 · 6月22日

《伊朗与以色列-美国热战及其对数字技术的影响》

《伊朗与以色列-美国热战及其对数字技术的影响》

专知会员服务

5+阅读 · 6月22日

《量子技术的军事任务技术适配与利用》

《量子技术的军事任务技术适配与利用》

专知会员服务

5+阅读 · 6月22日

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

专知会员服务

8+阅读 · 6月22日

相关VIP内容

深度学习模型图难画论文难中？这个ML Visual利器帮你快速画出漂亮的模型图,160个模板

深度学习模型图难画论文难中？这个ML Visual利器帮你快速画出漂亮的模型图,160个模板

专知会员服务

904+阅读 · 2022年3月1日

如何将先验知识嵌入机器学习？首篇《知信机器学习Informed ML》综述论文全面概述IML概念、分类、方法等，19页pdf

如何将先验知识嵌入机器学习？首篇《知信机器学习Informed ML》综述论文全面概述IML概念、分类、方法等，19页pdf

专知会员服务

108+阅读 · 2021年6月27日

【MLSS2020】最新《深度强化学习》教程，165页ppt与视频，Mila Doina Precup

【MLSS2020】最新《深度强化学习》教程，165页ppt与视频，Mila Doina Precup

专知会员服务

68+阅读 · 2020年7月12日

【MLSS2020】最新《几何深度学习》教程，帝国理工学院Michael Bronstein教授，166页ppt

【MLSS2020】最新《几何深度学习》教程，帝国理工学院Michael Bronstein教授，166页ppt

专知会员服务

111+阅读 · 2020年7月10日

Yoshua Bengio最新《深度学习》MLSS2020教程，附104页PPT及视频

Yoshua Bengio最新《深度学习》MLSS2020教程，附104页PPT及视频

专知会员服务

134+阅读 · 2020年7月10日

【MLSS2020】最新《元学习》教程，牛津大学Yee Whye Teh教授，165页ppt

【MLSS2020】最新《元学习》教程，牛津大学Yee Whye Teh教授，165页ppt

专知会员服务

137+阅读 · 2020年7月8日

【MLSS2020】大规模机器学习优化，195页ppt，法国Francis Bach研究员

【MLSS2020】大规模机器学习优化，195页ppt，法国Francis Bach研究员

专知会员服务

71+阅读 · 2020年7月4日

【MLSS2020硬核课】机器学习「因果性」，德国Bernhard Schölkopf教授，177页ppt

【MLSS2020硬核课】机器学习「因果性」，德国Bernhard Schölkopf教授，177页ppt

专知会员服务

115+阅读 · 2020年7月2日

【机器学习教程】生物导体MLInterfaces包到基因表达数据的应用，applications of the BioconductorMLInterfaces package to gene expression data

【机器学习教程】生物导体MLInterfaces包到基因表达数据的应用，applications of the BioconductorMLInterfaces package to gene expression data

专知会员服务

18+阅读 · 2020年1月11日

CMU博士论文：可微优化机器学习建模

CMU博士论文：可微优化机器学习建模

专知会员服务

64+阅读 · 2019年10月26日

热门VIP内容

开通专知VIP会员享更多权益服务

综述 | 世界动作模型：少做梦，多行动

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

美以伊冲突：无人机与人工智能的运用

相关资讯

【MLSS2020】最新《几何深度学习》教程，帝国理工学院Michael Bronstein教授，166页ppt

【MLSS2020】最新《几何深度学习》教程，帝国理工学院Michael Bronstein教授，166页ppt

专知

17+阅读 · 2020年7月10日

《可解释的机器学习-interpretable-ml》中文翻译版

《可解释的机器学习-interpretable-ml》中文翻译版

专知

88+阅读 · 2020年2月23日

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

专知

96+阅读 · 2019年9月30日

【Github】ML-NLP：机器学习、NLP面试中常考到的知识点和代码实现

【Github】ML-NLP：机器学习、NLP面试中常考到的知识点和代码实现

AINLP

10+阅读 · 2019年9月12日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

AI研习社

21+阅读 · 2018年6月14日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

相关论文

Encoder-Decoder Manifold Alignment for Idempotent Generation

Arxiv

0+阅读 · 6月21日

CodeTeam: An LLM-Powered Multi-Agent Framework for Repository-Level Code Generation

Arxiv

0+阅读 · 6月20日

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Arxiv

0+阅读 · 6月18日

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

Arxiv

0+阅读 · 6月18日

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

Arxiv

0+阅读 · 6月18日

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

Arxiv

0+阅读 · 6月17日

REKEY: Metadata-Grounded Visual-Key Regeneration for Contamination-Resilient VQA Evaluation

Arxiv

0+阅读 · 6月17日

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

Arxiv

0+阅读 · 6月16日

LLMs have Visualization Literacy: Now What? Experiments Exploring LLM Visualization Evaluation Capabilities

Arxiv

0+阅读 · 6月13日

An Introduction to Autoencoders

Arxiv

17+阅读 · 2022年1月11日

相关基金

天元数学交流项目“光声与超声联合成像中的相关反演理论及其算法的研究”

国家自然科学基金

2+阅读 · 2017年12月31日

纳米尺度自旋电子器件参数化电路模型建立方法的研究

国家自然科学基金

0+阅读 · 2017年12月31日

新型双组份Camassa-Holm方程的等谱问题及适定性研究

国家自然科学基金

0+阅读 · 2015年12月31日

可控制备的纳米级钨针尖应用于表面缺陷的扫描隧道显微学研究

国家自然科学基金

0+阅读 · 2015年12月31日

2D/3D视觉信息融合仿生SLAM关键问题研究

国家自然科学基金

3+阅读 · 2015年12月31日

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

可与MPSoC高度融合的片上自主测试-自主修复关键技术研究：针对自然、人为可靠性威胁

国家自然科学基金

0+阅读 · 2015年12月31日

面向信息安全芯片的物理不可克隆函数电路建模与实现

国家自然科学基金

0+阅读 · 2014年12月31日

压缩感知和稀疏优化中的非凸优化算法设计

国家自然科学基金

2+阅读 · 2014年12月31日

隐写模糊安全性测度及其优化嵌入算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员