WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents

In the realm of web agent research, achieving both generalization and accuracy remains a challenging problem. Due to high variance in website structure, existing approaches often fail. Moreover, existing fine-tuning and in-context learning techniques fail to generalize across multiple websites. We introduce Wilbur, an approach that uses a differentiable ranking model and a novel instruction synthesis technique to optimally populate a black-box large language model's prompt with task demonstrations from previous runs. To maximize end-to-end success rates, we also propose an intelligent backtracking mechanism that learns and recovers from its mistakes. Finally, we show that our ranking model can be trained on data from a generative auto-curriculum which samples representative goals from an LLM, runs the agent, and automatically evaluates it, with no manual annotation. Wilbur achieves state-of-the-art results on the WebVoyager benchmark, beating text-only models by 8% overall, and up to 36% on certain websites. On the same benchmark, Wilbur is within 5% of a strong multi-modal model despite only receiving textual inputs, and further analysis reveals a substantial number of failures are due to engineering challenges of operating the web.

翻译：在Web智能体研究领域，实现泛化性与准确性的兼顾仍是一项挑战性难题。由于网站结构的高度差异性，现有方法往往难以奏效。此外，现有的微调与上下文学习技术无法跨多个网站进行泛化。我们提出名为Wilbur的方法，该方法通过可微分排序模型与新型指令合成技术，将先前运行的任务示范示例最优地注入黑盒大语言模型的提示中。为最大化端到端成功率，我们还提出一种智能回溯机制，使其能够从错误中学习并恢复。最后，我们证明该排序模型可通过生成式自动课程学习的数据进行训练——该课程利用大语言模型采样代表性目标，驱动智能体执行任务并自动评估结果，全程无需人工标注。Wilbur在WebVoyager基准测试中达到最先进水平，整体性能比纯文本模型高出8%，在特定网站上甚至提升36%。在相同基准下，尽管仅接收文本输入，Wilbur的性能仍接近强多模态模型（差距在5%以内），进一步分析表明，大量失败案例源于操作Web的工程性挑战。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日