基于差分思维的强化学习：通过差分视觉推理策略实现激励 (Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy)

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced reasoning capabilities in Large Language Models. However, adapting RLVR to multimodal domains suffers from a critical \textit{perception-reasoning decoupling}. Existing paradigms, driven by text-centric outcome rewards, reasoning in language medium, inadvertently encourage models to bypass visual perception. We empirically validate this through blind experiments: state-of-the-art policies maintain or surprisingly improve performance even when visual inputs are entirely removed. This reveals that these models degenerate into \textit{blind reasoners}, exploiting linguistic priors to generate plausible answers instead of attending to visual evidence. In response, we propose \textbf{Thinking with Deltas}, a framework driven by a \textbf{Differential Visual Reasoning Policy (DVRP)}. DVRP introduces intrinsic supervision via visual triplets, comprising original, masked, and perturbed inputs. It optimizes the model to maximize reasoning divergence from masked inputs (enforcing \textit{visual sensitivity}) while minimizing divergence from perturbed inputs (ensuring \textit{visual robustness}). By aligning reasoning variations strictly with the \textit{Delta} of visual information, DVRP inherently bolsters visual understanding capabilities and significantly outperforms state-of-the-art methods on both general and medical benchmarks, without requiring external annotations or auxiliary tools.

翻译：可验证奖励强化学习（RLVR）显著提升了大型语言模型的推理能力。然而，将RLVR应用于多模态领域时，存在关键的“感知-推理解耦”问题。现有范式以文本为中心的结果奖励为驱动，在语言媒介中进行推理，无意中鼓励模型绕过视觉感知。我们通过盲实验对此进行了实证验证：即使视觉输入被完全移除，最先进的策略仍能保持甚至意外地提升性能。这表明这些模型退化为“盲推理器”，利用语言先验生成看似合理的答案，而非关注视觉证据。为此，我们提出**基于差分思维**的框架，该框架由**差分视觉推理策略（DVRP）**驱动。DVRP通过视觉三元组（包含原始、掩码和扰动输入）引入内在监督。它优化模型以最大化与掩码输入之间的推理差异（强制“视觉敏感性”），同时最小化与扰动输入之间的推理差异（确保“视觉鲁棒性”）。通过将推理变化严格与视觉信息的“差分”对齐，DVRP本质上增强了视觉理解能力，并在通用和医学基准测试中显著优于现有最先进方法，且无需外部标注或辅助工具。