Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.
翻译:尽管自主网络智能体发展迅速,但在任务执行过程中,人类参与对于塑造偏好和纠正智能体行为仍然至关重要。然而,当前的智能体系统缺乏对"人类何时以及为何介入"的原则性理解,常常在越过关键决策点时仍自主运行,或请求不必要的确认。本工作中,我们引入了"人类介入建模"这一任务,以支持协作式网络任务执行。我们收集了CowCorpus数据集,包含400条真实用户的网络导航轨迹,涵盖超过4,200项交错进行的人类与智能体操作。我们识别出用户与智能体交互的四种差异化模式——放手式监督、动手式监察、协作式任务解决以及完全用户接管。基于这些洞察,我们训练语言模型(LMs)以根据用户的交互风格预测其可能的介入时机,相比基础语言模型,介入预测准确率提升了61.4-63.4%。最后,我们将这些具备介入感知能力的模型部署于实时网络导航智能体中,并通过用户研究进行评估,发现用户评价的智能体实用性提升了26.5%。综合来看,我们的结果表明:对人类介入进行结构化建模,能够催生更具适应性与协作性的智能体。