Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web-based information-seeking tasks. However, existing efforts mainly focus on single-fact retrieval and rely on outcome-only verification, thereby limiting their scalability in realistic knowledge-intensive scenarios that involve long-horizon web tasks requiring large-scale retrieval and synthesis of information from diverse sources. In this work, we introduce VeriWeb, a novel verifiable long-chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions: (1) long-chain complexity, encompassing both breadth- and depth-oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and consistent context tracking in multi-hop reasoning; and (2) subtask-level verifiability, where tasks are decomposed into a sequence of interdependent verifiable subtasks. This structure enables diverse exploration strategies within each subtask, while ensuring that each subtask-level answer remains unchanged and verifiable. The benchmark consists of 302 tasks across five real-world domains, each with a complete trajectory demonstration, annotated by human experts. Extensive experiments on VeriWeb using various agents powered by different foundation models reveal significant performance gaps in handling long-horizon web tasks, highlighting the need for more powerful agentic information-seeking capabilities.
翻译:近期研究展示了大型语言模型(LLM)智能体在处理网络信息检索任务方面的卓越能力。然而,现有工作主要集中于单事实检索,且依赖结果导向的验证方式,这限制了其在现实知识密集型场景中的可扩展性——此类场景通常涉及需要从多源进行大规模信息检索与整合的长视野网络任务。本研究提出VeriWeb,一种新颖的可验证长链网络基准,旨在促进现实网络环境中网络智能体的评估与开发。该基准强调两个关键维度:(1)长链复杂性:涵盖广度导向与深度导向的搜索任务,以评估网络智能体在多跳推理中确保信息覆盖全面性与上下文追踪一致性的能力;(2)子任务级可验证性:将任务分解为一系列相互依赖的可验证子任务。该结构支持在每个子任务内采用多样化探索策略,同时确保每个子任务层级的答案保持恒定且可验证。本基准包含五个现实领域的302项任务,每项任务均配备由专家标注的完整轨迹演示。基于不同基础模型驱动的多种智能体在VeriWeb上进行的广泛实验表明,现有方法在处理长视野网络任务时存在显著性能差距,凸显了对更强大智能信息检索能力的需求。