Web agents based on large language models have demonstrated promising capability in automating web tasks. However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause losses and lead to task failure. To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback-driven action refinement. To overcome the cognitive isolation of individual models, we introduce a multi-agent collaboration process that enables an action model to consult a world model as a web-environment expert for strategic guidance; the action model then grounds these suggestions into executable actions, leveraging prior knowledge of environmental state transition dynamics to enhance candidate action proposal. To achieve risk-aware resilient task execution, we introduce a two-stage deduction chain. A world model, specialized in environmental state transitions, simulates action outcomes, which a judge model then scrutinizes to trigger action corrective feedback when necessary. Experiments show that WAC achieves absolute gains of 1.8% on VisualWebArena and 1.3% on Online-Mind2Web.
翻译:基于大语言模型的网页智能体在自动化网页任务中展现出显著潜力。然而,现有智能体受限于环境变化预测能力,难以推理出合理的操作序列,且缺乏对执行风险的全面认知,可能过早执行高风险动作导致损失与任务失败。为应对这些挑战,我们提出WAC——一种融合模型协同、结果模拟与反馈驱动动作优化的网页智能体。为突破单模型认知局限,我们设计多智能体协作流程:动作模型可咨询作为网页环境专家的世界模型以获取策略指导,继而将这些建议与对环境状态转移动态的先验知识相结合,生成可执行动作候选方案。为实现风险感知的鲁棒任务执行,我们构建两阶段推理链:专精环境状态转移的世界模型模拟动作结果,再由评判模型进行审查,必要时触发动作修正反馈。实验表明,WAC在VisualWebArena和Online-Mind2Web基准上分别实现1.8%和1.3%的绝对性能提升。