In the realm of web agent research, achieving both generalization and accuracy remains a challenging problem. Due to high variance in website structure, existing approaches often fail. Moreover, existing fine-tuning and in-context learning techniques fail to generalize across multiple websites. We introduce Wilbur, an approach that uses a differentiable ranking model and a novel instruction synthesis technique to optimally populate a black-box large language model's prompt with task demonstrations from previous runs. To maximize end-to-end success rates, we also propose an intelligent backtracking mechanism that learns and recovers from its mistakes. Finally, we show that our ranking model can be trained on data from a generative auto-curriculum which samples representative goals from an LLM, runs the agent, and automatically evaluates it, with no manual annotation. Wilbur achieves state-of-the-art results on the WebVoyager benchmark, beating text-only models by 8% overall, and up to 36% on certain websites. On the same benchmark, Wilbur is within 5% of a strong multi-modal model despite only receiving textual inputs, and further analysis reveals a substantial number of failures are due to engineering challenges of operating the web.
翻译:在Web智能体研究领域,实现泛化性与准确性的兼顾仍是一项挑战性难题。由于网站结构的高度差异性,现有方法往往难以奏效。此外,现有的微调与上下文学习技术无法跨多个网站进行泛化。我们提出名为Wilbur的方法,该方法通过可微分排序模型与新型指令合成技术,将先前运行的任务示范示例最优地注入黑盒大语言模型的提示中。为最大化端到端成功率,我们还提出一种智能回溯机制,使其能够从错误中学习并恢复。最后,我们证明该排序模型可通过生成式自动课程学习的数据进行训练——该课程利用大语言模型采样代表性目标,驱动智能体执行任务并自动评估结果,全程无需人工标注。Wilbur在WebVoyager基准测试中达到最先进水平,整体性能比纯文本模型高出8%,在特定网站上甚至提升36%。在相同基准下,尽管仅接收文本输入,Wilbur的性能仍接近强多模态模型(差距在5%以内),进一步分析表明,大量失败案例源于操作Web的工程性挑战。