VisRL：基于强化推理的意图驱动视觉感知 (VisRL: Intention-Driven Visual Perception via Reinforced Reasoning)

Visual understanding is inherently intention-driven - humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at https://github.com/zhangquanchen/VisRL.

翻译：视觉理解本质上是意图驱动的——人类会根据目标有选择地关注场景的不同区域。大型多模态模型（LMMs）的最新进展使得通过自然语言灵活表达此类意图成为可能，允许通过查询来引导视觉推理过程。诸如视觉思维链（Visual Chain-of-Thought）等框架已经证明了引入显式推理步骤的益处，即模型在回答查询之前先预测一个关注区域。然而，现有方法严重依赖于带有标注中间边界框的监督训练，由于意图-区域对的组合爆炸，这严重限制了其可扩展性。为了克服这一限制，我们提出了VisRL，这是首个将强化学习（RL）应用于意图驱动视觉感知问题的框架。VisRL仅使用奖励信号来优化整个视觉推理过程。通过将中间焦点选择视为一个通过试错优化的内部决策，我们的方法消除了对昂贵区域标注的需求，同时更贴近人类学习感知世界的方式。在多个基准测试上进行的大量实验表明，VisRL始终优于强大的基线模型，证明了其有效性以及在各种不同LMMs上的强大泛化能力。我们的代码可在 https://github.com/zhangquanchen/VisRL 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日