SPARC：分离感知与推理回路以实现视觉语言模型的测试时扩展 (SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs)

Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.

翻译：尽管近期取得了成功，但测试时扩展——即在推理过程中根据需求动态增加令牌预算——对于视觉语言模型（VLMs）而言仍然脆弱：关于图像的非结构化思维链将感知与推理纠缠在一起，导致冗长且混乱的上下文，其中微小的感知错误可能级联为完全错误的答案。此外，需要采用手工设计奖励的昂贵强化学习才能实现良好性能。本文提出SPARC（分离感知与推理回路），这是一个显式解耦视觉感知与推理的模块化框架。受大脑中序列性感觉-认知处理过程的启发，SPARC采用两阶段流程：模型首先执行显式视觉搜索以定位问题相关区域，随后基于这些区域进行条件化推理以生成最终答案。这种分离机制支持通过非对称计算分配实现独立的测试时扩展（例如在分布偏移下优先处理感知阶段），允许选择性优化（例如当感知阶段成为端到端性能瓶颈时可单独改进该阶段），并能通过压缩上下文实现高效计算——在较低图像分辨率下执行全局搜索，仅对选定区域分配高分辨率处理，从而减少视觉令牌总数与计算量。在多个具有挑战性的视觉推理基准测试中，SPARC的表现优于单体基线模型和强视觉定位方法。例如，SPARC将Qwen3VL-4B在$V^*$ VQA基准上的准确率提升了6.7个百分点，在具有挑战性的OOD任务中，其性能超越"thinking with images"方法4.6个百分点，同时所需令牌预算降低200倍。