Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.
翻译:近期,基于可验证奖励的强化学习(RLVR)的进展显著提升了视觉语言模型(VLM)的复杂推理能力。然而,其基于结果的监督过于粗糙,难以诊断和纠正推理链中的错误。为此,我们提出了Perceval,一种过程奖励模型(PRM),支持令牌级别的错误定位。它能够从响应中提取与图像相关的陈述,并逐一与图像中的视觉证据进行对比,最终返回包含感知错误的陈述。Perceval通过感知密集型监督训练数据进行训练。随后,我们将Perceval集成到强化学习训练过程中以训练策略模型。具体而言,与传统的GRPO(采用序列级优势)相比,我们通过在Perceval识别的幻觉片段上施加惩罚来应用令牌级优势,从而实现细粒度的监督信号。除了增强训练过程,Perceval还能在推理阶段辅助VLM。利用Perceval,我们可以截断模型响应中的错误部分,然后让模型直接重新生成响应,或引导模型反思其之前的输出。这一过程可重复多次,以实现测试时扩展。实验表明,在使用强化学习训练的多个推理VLM上,来自各个领域的基准测试均取得了显著改进,突显了以感知为中心的监督作为一种通用策略的前景。对于测试时扩展,它也比其他策略(如多数投票)表现出持续的性能提升。我们的代码和数据将在https://github.com/RUCAIBox/Perceval公开。