Visual understanding is inherently intention-driven - humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at https://github.com/zhangquanchen/VisRL.
翻译:视觉理解本质上是意图驱动的——人类会根据目标有选择地关注场景的不同区域。大型多模态模型(LMMs)的最新进展使得通过自然语言灵活表达此类意图成为可能,允许通过查询来引导视觉推理过程。诸如视觉思维链(Visual Chain-of-Thought)等框架已经证明了引入显式推理步骤的益处,即模型在回答查询之前先预测一个关注区域。然而,现有方法严重依赖于带有标注中间边界框的监督训练,由于意图-区域对的组合爆炸,这严重限制了其可扩展性。为了克服这一限制,我们提出了VisRL,这是首个将强化学习(RL)应用于意图驱动视觉感知问题的框架。VisRL仅使用奖励信号来优化整个视觉推理过程。通过将中间焦点选择视为一个通过试错优化的内部决策,我们的方法消除了对昂贵区域标注的需求,同时更贴近人类学习感知世界的方式。在多个基准测试上进行的大量实验表明,VisRL始终优于强大的基线模型,证明了其有效性以及在各种不同LMMs上的强大泛化能力。我们的代码可在 https://github.com/zhangquanchen/VisRL 获取。