VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Weiyun Wang,Zhangwei Gao,Lianjie Chen,Zhe Chen,Jinguo Zhu,Xiangyu Zhao,Yangzhou Liu,Yue Cao,Shenglong Ye,Xizhou Zhu,Lewei Lu,Haodong Duan,Yu Qiao,Jifeng Dai,Wenhai Wang

We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.

翻译：我们提出VisualPRM，这是一个拥有80亿参数的高级多模态过程奖励模型（PRM），它通过Best-of-N（BoN）评估策略提升了现有多模态大语言模型（MLLMs）在不同模型规模和架构下的推理能力。具体而言，我们的模型提升了三种类型MLLM和四种不同模型规模的推理性能。即使应用于性能强大的InternVL2.5-78B模型，它也在七个多模态推理基准上实现了5.9分的提升。实验结果表明，在BoN评估中，我们的模型相较于结果奖励模型和自一致性方法展现出更优越的性能。为促进多模态PRM的训练，我们通过自动化数据流程构建了一个多模态过程监督数据集VisualPRM400K。针对多模态PRM的评估，我们提出了VisualProcessBench——一个包含人工标注的逐步骤正确性标签的基准测试，用于衡量PRM在多模态推理任务中检测错误步骤的能力。我们希望本工作能激发更多未来研究，并为MLLMs的发展做出贡献。我们的模型、数据和基准测试已发布于 https://internvl.github.io/blog/2025-03-13-VisualPRM/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日