Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.
翻译:视觉-语言模型(VLMs)在解析富含文本的图像方面表现出色,但在处理需要跨多页信息分析与整合的视觉复杂长文档时仍面临挑战。现有方法通常依赖固定的推理模板或刚性流程,迫使VLMs处于被动角色,既影响效率也限制泛化能力。本文提出主动长文档导航框架(ALDEN),这是一种基于多轮强化学习的框架,通过微调VLMs使其成为能够主动浏览视觉丰富长文档的交互式智能体。ALDEN引入了一种新颖的按索引直接访问页面的获取动作,与经典搜索动作互补,从而更好地利用文档结构。为实现密集的过程监督与高效训练,我们提出一种基于规则的跨层级奖励机制,同时提供轮次级和词元级信号。针对长文档中大量视觉词元导致的训练不稳定问题,我们进一步提出视觉-语义锚定机制,通过双路径KL散度约束在训练中分别稳定视觉与文本表示。基于三个开源数据集构建的语料库进行训练后,ALDEN在五个长文档基准测试中取得了最先进的性能。总体而言,ALDEN标志着从被动文档阅读向智能体自主导航与推理视觉丰富长文档的跨越,为更精准高效的长文档理解提供了可靠路径。