The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the generalization capability of foundation models. Such a generalist agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent's skill repertoire will necessarily be limited due to the quantity and diversity of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator, an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information of the environment such as user demos or even just the name of the website itself for Internet-browsing agents. Then, the agent policy attempts those tasks with thoughts and actual grounded operations in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and self-hosted websites from WebVoyager and WebArena.To the best of our knowledge, this work represents the first effective learning system to apply autonomous task proposal with RL for agents that generalizes real-world human-annotated benchmarks with SOTA performances. Our open-source checkpoints and code can be found in https://yanqval.github.io/PAE/
翻译:得益于基础模型的泛化能力,具备广泛能力且目标导向的智能体(例如数字世界中的互联网浏览代理和物理世界中的家庭人形机器人)的愿景已迅速推进。此类通用智能体需要拥有庞大而多样的技能库,例如查找两个旅行地点之间的路线以及在互联网上购买特定商品。如果每项技能都需要通过固定的人工标注指令集手动指定,那么由于人工标注指令的数量和多样性限制,智能体的技能库必然受限。在本工作中,我们通过提出提议者-代理-评估者这一有效的学习系统来解决这一挑战,该系统使基础模型智能体能够在开放环境中自主发现并练习技能。PAE的核心是一个上下文感知的任务提议器,它能根据环境上下文信息(例如用户演示,甚至对于互联网浏览代理而言仅需网站名称)自主为智能体提出练习任务。随后,智能体策略通过思维过程和在现实世界中的实际具身操作尝试这些任务,产生的轨迹由基于视觉语言模型的自主成功评估器进行评估。成功评估结果作为奖励信号,供智能体通过强化学习优化其策略。我们在具有挑战性的基于视觉的网络导航任务上验证了PAE,使用了来自WebVoyager和WebArena的真实世界网站及自托管网站。据我们所知,本工作首次提出了将自主任务提议与强化学习相结合的有效学习系统,该系统能泛化至真实世界人工标注基准测试并取得最先进的性能。我们的开源检查点与代码可在 https://yanqval.github.io/PAE/ 获取。