We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.
翻译:我们提出VisualPRM,这是一个拥有80亿参数的高级多模态过程奖励模型(PRM),它通过Best-of-N(BoN)评估策略提升了现有多模态大语言模型(MLLMs)在不同模型规模和架构下的推理能力。具体而言,我们的模型提升了三种类型MLLM和四种不同模型规模的推理性能。即使应用于性能强大的InternVL2.5-78B模型,它也在七个多模态推理基准上实现了5.9分的提升。实验结果表明,在BoN评估中,我们的模型相较于结果奖励模型和自一致性方法展现出更优越的性能。为促进多模态PRM的训练,我们通过自动化数据流程构建了一个多模态过程监督数据集VisualPRM400K。针对多模态PRM的评估,我们提出了VisualProcessBench——一个包含人工标注的逐步骤正确性标签的基准测试,用于衡量PRM在多模态推理任务中检测错误步骤的能力。我们希望本工作能激发更多未来研究,并为MLLMs的发展做出贡献。我们的模型、数据和基准测试已发布于 https://internvl.github.io/blog/2025-03-13-VisualPRM/。