Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.
翻译:思维链推理显著提升了大型语言模型(LLM)在多个领域的性能。然而,这一推理过程此前仅局限于文本空间,限制了其在视觉密集型任务中的有效性。为突破这一局限,我们提出了像素空间推理的概念。在这一新颖框架中,视觉语言模型(VLM)被赋予一系列视觉推理操作,例如放大和选区框选。这些操作使VLM能够直接检查、探查视觉证据并从中进行推断,从而提升视觉任务中的推理保真度。在VLM中培养此类像素空间推理能力面临显著挑战,包括模型初始能力的不均衡性及其对新增像素空间操作的排斥倾向。我们通过两阶段训练方法应对这些挑战:第一阶段采用基于合成推理轨迹的指令微调,使模型熟悉新颖的视觉操作;随后,强化学习(RL)阶段利用好奇心驱动的奖励机制,平衡像素空间推理与文本推理之间的探索。借助这些视觉操作,VLM能够与复杂视觉输入(如信息密集的图像或视频)进行交互,主动收集必要信息。我们证明该方法显著提升了VLM在多种视觉推理基准测试中的性能。我们的70亿参数模型\model在V* bench上达到84%,在TallyQA-Complex上达到74%,在InfographicsVQA上达到84%,创造了当前开源模型的最高准确率记录。这些结果凸显了像素空间推理的重要性及我们框架的有效性。