LLM-driven GUI agents are increasingly used in production systems to automate workflows and simulate users for evaluation and optimization. Yet most GUI-agent evaluations emphasize task success and provide limited evidence on whether agents interact in human-like ways. We present a trace-level evaluation framework that compares human and agent behavior across (i) task outcome and effort, (ii) query formulation, and (iii) navigation across interface states. We instantiate the framework in a controlled study in a production audio-streaming search application, where 39 participants and a state-of-the-art GUI agent perform ten multi-hop search tasks. The agent achieves task success comparable to participants and generates broadly aligned queries, but follows systematically different navigation strategies: participants exhibit content-centric, exploratory behavior, while the agent is more search-centric and low-branching. These results show that outcome and query alignment do not imply behavioral alignment, motivating trace-level diagnostics when deploying GUI agents as proxies for users in production search systems.
翻译:大语言模型驱动的图形用户界面(GUI)代理正越来越多地被用于生产系统中,以自动化工作流程并模拟用户进行评估与优化。然而,大多数GUI代理评估仅强调任务成功,对于代理是否以类似人类的方式交互所提供的证据有限。我们提出一种痕量级评估框架,从以下三个维度比较人类与代理的行为:(i)任务结果与努力,(ii)查询构建,以及(iii)界面状态间的导航。我们在一个生产级音频流搜索应用中开展了受控研究,以实例化该框架。研究中,39名参与者与一个先进的GUI代理执行了十项多跳搜索任务。代理在任务成功率上与参与者相当,生成的查询也大致对齐,但其导航策略系统性地与人类不同:参与者表现出以内容为中心的探索行为,而代理则更偏向以搜索为中心且分支较少。这些结果表明,结果与查询的对齐并不等同于行为对齐,从而凸显了在将GUI代理作为用户代理部署于生产搜索系统时,进行痕量级诊断的必要性。