With the advancement of Large-Language Models (LLMs) and Large Vision-Language Models (LVMs), agents have shown significant capabilities in various tasks, such as data analysis, gaming, or code generation. Recently, there has been a surge in research on web agents, capable of performing tasks within the web environment. However, the web poses unforeseeable scenarios, challenging the generalizability of these agents. This study investigates the disparities between human and web agents' performance in web tasks (e.g., information search) by concentrating on planning, action, and reflection aspects during task execution. We conducted a web task study with a think-aloud protocol, revealing distinct cognitive actions and operations on websites employed by humans. Comparative examination of existing agent structures and human behavior with thought processes highlighted differences in knowledge updating and ambiguity handling when performing the task. Humans demonstrated a propensity for exploring and modifying plans based on additional information and investigating reasons for failure. These findings offer insights into designing planning, reflection, and information discovery modules for web agents and designing the capturing method for implicit human knowledge in a web task.
翻译:随着大语言模型(LLMs)和大视觉语言模型(LVMs)的进步,代理在数据分析、游戏、代码生成等各类任务中展现出显著能力。近年来,针对能够在网络环境中执行任务的网络代理的研究激增。然而,网络环境存在不可预见的情景,对这些代理的泛化能力构成挑战。本研究通过聚焦任务执行过程中的规划、行动与反思环节,探究人类与网络代理在网页任务(如信息搜索)中的表现差异。我们采用有声思维法开展了一项网页任务研究,揭示了人类在网站操作中运用的独特认知行为与操作模式。通过对现有代理结构与人类行为及思维过程的比较分析,我们发现了两者在知识更新和歧义处理方面的差异。人类表现出基于额外信息探索和调整计划、以及探究失败原因的倾向。这些发现为网络代理的规划、反思和信息发现模块设计,以及网页任务中隐含人类知识的捕获方法设计提供了启示。