The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, a significant modality gap remains in processing the web's most dynamic and information-dense modality: video. In this paper, we first formalize the task of Agentic Video Browsing and introduce Video-BrowseComp, a benchmark evaluating open-ended agentic browsing tasks that enforce a mandatory dependency on videos. We observe that current paradigms struggle to reconcile the scale of open-ended video exploration with the need for fine-grained visual verification. Direct visual inference (e.g., RAG) maximizes perception but incurs prohibitive context costs, while text-centric summarization optimizes efficiency but often misses critical visual details required for accurate grounding. To address this, we propose Video-Browser, a novel agent leveraging Pyramidal Perception, filtering with cheap metadata and zooming in with expensive visual perception only when necessary. Experiments demonstrate that our approach achieves a 37.5% relative improvement while reducing token consumption by 58.3% compared to Direct visual inference, establishing a foundation for verifiable open-web video research. We open-source all codes, benchmark at {https://anonymous.4open.science/r/VideoBrowser} and {https://github.com/chrisx599/Video-Browser}.
翻译:自主智能体的发展正在重新定义信息获取方式,从被动检索转向主动、开放式的网络研究。然而,在处理网络中最具动态性和信息密集性的模态——视频时,仍存在显著的模态鸿沟。本文首先形式化了自主视频浏览任务,并提出了Video-BrowseComp基准,用于评估强制依赖视频的开放式自主浏览任务。我们观察到,现有范式难以在开放式视频探索的规模需求与细粒度视觉验证需求之间取得平衡。直接视觉推理(如RAG)虽能最大化感知能力,但会产生极高的上下文成本;而以文本为中心的摘要方法虽优化了效率,却常常遗漏准确落地所需的关键视觉细节。为解决这一问题,我们提出了Video-Browser——一种采用金字塔式感知的新型智能体,通过廉价元数据进行过滤,仅在必要时调用昂贵的视觉感知进行聚焦分析。实验表明,与直接视觉推理相比,我们的方法实现了37.5%的相对性能提升,同时减少了58.3%的令牌消耗,为可验证的开放网络视频研究奠定了基础。我们已开源所有代码与基准测试资源,访问地址为{https://anonymous.4open.science/r/VideoBrowser}和{https://github.com/chrisx599/Video-Browser}。