Multimodal large-scale models have significantly advanced the development of web agents, enabling perception and interaction with digital environments akin to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and Procedural. In this framework, knowledge content learning corresponds to the agent's processes of Memorizing and Understanding, which rely on the first two knowledge types, representing the "what" of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the "how" of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill core knowledge necessary for web agent. This dataset serves as the agent's conceptual grounding-the "nouns" upon which comprehension is built-as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. To enable rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data is open sourced at https://github.com/Gnonymous/Web-CogReasoner
翻译:多模态大模型显著推动了网络智能体的发展,使其能够以类人认知方式感知数字环境并与之交互。本文主张,网络智能体必须首先获取足够知识才能有效进行认知推理。为此,我们将网络智能体的能力分解为两个核心阶段:知识内容学习与认知过程。为形式化这一框架,我们提出Web-CogKnowledge框架,将知识划分为事实性、概念性与程序性三类。在该框架中,知识内容学习对应智能体的记忆与理解过程,依赖于前两类知识,表征学习的“内容”维度;而认知过程则对应基于程序性知识的探索过程,界定推理与行动的“方法”维度。为促进知识获取,我们构建了Web-CogDataset——一个从14个真实网站提炼的结构化资源库,旨在系统化注入网络智能体所需的核心知识。该数据集既构成智能体的概念基础(即理解所依托的“名词”),也作为学习如何推理与行动的依据。基于此基础,我们通过新颖的知识驱动思维链推理框架将这些过程操作化,开发并训练了所提出的Web-CogReasoner智能体。大量实验表明其显著优于现有模型,尤其在需要结构化知识决策的未见任务泛化方面表现突出。为建立严谨评估体系,我们提出了Web-CogBench——一个综合性评估套件,用于衡量和比较智能体在既定知识领域与认知能力上的表现。相关代码与数据已在https://github.com/Gnonymous/Web-CogReasoner开源。