While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.
翻译:尽管近期视觉语言模型展现出强大的多模态理解能力,但在需要主动证据获取与多步视觉交互的空间推理任务中仍存在局限。这一局限性表明,仅依赖视觉编码器中的隐式视觉表征不足以还原细粒度空间证据。我们提出感知-交互-推理智能体(PERIA)——一种用于地图推理、视觉探测与视觉重建等空间推理任务的工具增强型视觉智能体。PERIA 采用两类轻量工具族:视觉感知工具用于提取文本、符号与空间证据,视觉交互工具用于操控视觉上下文、追踪路径及验证空间关系。为训练 PERIA,我们开发了一套统一方案,融合了监督式工具使用轨迹合成、复合奖励机制以及观测松弛型组内组策略优化(OR-GIGPO)以实现高效的多工具行为。在来自 8 个数据集的 13 个基准测试上的实验表明,PERIA-8B 在分布内基准测试上较 Qwen3-8B 主干提升 10.0%,在分布外基准测试上提升 4.4%,同时以 7.0%-14.8% 的幅度超越此前同尺寸最优基线。该模型还取得了与 Qwen3-VL-235B-A22B-Thinking 及 GPT-5 等更大规模模型相当的性能,充分证明了 PERIA 在增强空间推理能力方面的有效性。