VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.

翻译：视觉检索增强生成（VRAG）使视觉-语言模型能够检索并推理富含视觉信息的文档。为解决需要多步推理的复杂查询，智能VRAG系统将推理与迭代检索交替进行。然而，现有智能VRAG面临两个关键瓶颈：（1）视觉证据稀疏性：关键证据分散于不同页面却孤立处理，阻碍跨页推理；同时，细粒度的图像内证据常需精确的视觉操作，误用此类操作会降低检索质量；（2）长视域搜索漂移：跨检索页面累积的视觉表征会稀释上下文并导致认知过载，使智能体偏离搜索目标。针对上述挑战，我们提出VISOR（通过迭代搜索与超视距推理的视觉检索增强生成），一个统一的单智能体框架。VISOR采用结构化证据空间实现渐进式跨页推理，并配备视觉操作评估与校正机制管理视觉操作。此外，我们引入带滑动窗口的动态轨迹与意图注入技术缓解搜索漂移，该技术锚定证据空间的同时丢弃早期原始交互，防止上下文被视觉表征淹没。我们采用基于分组相对策略优化的强化学习（GRPO-based RL）流程训练VISOR，该流程结合状态掩码与针对动态上下文重建定制的信用分配机制。在ViDoSeek、SlideVQA及MMLongBench上的大量实验表明，VISOR在长视域视觉推理任务中实现了最先进的性能与卓越的效率。