像素推理器：通过好奇心驱动的强化学习激励像素空间推理 (Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning)

Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

翻译：思维链推理显著提升了大型语言模型（LLM）在多个领域的性能。然而，这一推理过程此前仅局限于文本空间，限制了其在视觉密集型任务中的有效性。为突破这一局限，我们提出了像素空间推理的概念。在这一新颖框架中，视觉语言模型（VLM）被赋予一系列视觉推理操作，例如放大和选区框选。这些操作使VLM能够直接检查、探查视觉证据并从中进行推断，从而提升视觉任务中的推理保真度。在VLM中培养此类像素空间推理能力面临显著挑战，包括模型初始能力的不均衡性及其对新增像素空间操作的排斥倾向。我们通过两阶段训练方法应对这些挑战：第一阶段采用基于合成推理轨迹的指令微调，使模型熟悉新颖的视觉操作；随后，强化学习（RL）阶段利用好奇心驱动的奖励机制，平衡像素空间推理与文本推理之间的探索。借助这些视觉操作，VLM能够与复杂视觉输入（如信息密集的图像或视频）进行交互，主动收集必要信息。我们证明该方法显著提升了VLM在多种视觉推理基准测试中的性能。我们的70亿参数模型\model在V* bench上达到84%，在TallyQA-Complex上达到74%，在InfographicsVQA上达到84%，创造了当前开源模型的最高准确率记录。这些结果凸显了像素空间推理的重要性及我们框架的有效性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日