OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.

翻译：组相对策略优化（GRPO）已成为推动多模态大语言模型近期进展的事实上的强化学习目标。然而，将这一成功扩展到开源多模态通用模型仍面临两大挑战：不同视觉任务间奖励拓扑结构的极端差异，以及平衡细粒度感知与多步推理能力的固有困难。为解决这些问题，我们提出了高斯GRPO（G²RPO），一种新型强化学习训练目标，用于替代标准线性缩放，采用非线性分布匹配。通过数学上强制任意任务的奖励优势分布严格收敛到标准正态分布𝒩(0,1)，G²RPO从理论上确保了任务间梯度公平性，减轻了对重尾异常值的敏感性，并为正负奖励提供对称更新。借助G²RPO增强的训练稳定性，我们引入了两种任务级整形机制以无缝平衡感知与推理。首先，响应长度整形动态地针对复杂查询生成扩展推理链，同时强制直接输出以增强视觉基础。其次，熵整形严格约束模型的探索范围，有效防止熵崩溃和熵爆炸。整合这些方法，我们提出了OpenVLThinkerV2，一个高度鲁棒的通用多模态模型。在18个多样化基准上的广泛评估表明，其性能优于强大的开源模型和领先的专有前沿模型。