Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.
翻译:强化学习已成为提升多模态大语言模型复杂推理能力的关键机制。然而,现有范式通常依赖于模型独立工作的单一推演策略。这种中间监督的缺失使得推理过程容易受到错误传播的影响——早期的逻辑偏差会级联成不可逆的失败,从而产生噪声化的优化信号。本文提出\textbf{导向验证器}框架以解决这些结构性局限。该框架超越被动的终端奖励机制,引入一个动态验证器,使其能够与策略模型协同解决任务。在推演阶段,该验证器与策略模型实时交互,检测不一致性并提供方向性信号,以引导模型朝向有效轨迹。为此,我们开发了针对多模态幻觉的专用数据合成流程,构建了包含过程级负例与\textbf{协}作式指\textbf{导}推理轨迹的\textbf{CoRe}数据集,用于训练导向验证器。在MathVista、MathVerse和MMMU基准上的大量实验表明,通过将计算资源分配给协作推理与动态验证,一个80亿参数的模型能够实现强劲的性能表现。