WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Zehan Qi,Xiao Liu,Iat Long Iong,Hanyu Lai,Xueqiao Sun,Wenyi Zhao,Yu Yang,Xinyue Yang,Jiadai Sun,Shuntian Yao,Tianjie Zhang,Wei Xu,Jie Tang,Yuxiao Dong

from arxiv, Published as a conference paper at ICLR 2025

Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.

翻译：大型语言模型（LLM）作为自主智能体展现出巨大潜力，尤其在基于网络的任务中。然而，现有的LLM网络智能体严重依赖昂贵的专有LLM API，而开源LLM则缺乏必要的决策能力。本文提出WebRL，一种自演化在线课程强化学习框架，旨在利用开源LLM训练高性能网络智能体。WebRL解决了构建LLM网络智能体的三个关键挑战：训练任务稀缺、反馈信号稀疏以及在线学习中的策略分布漂移。具体而言，WebRL包含：1）从失败尝试中生成新任务的自演化课程机制；2）稳健的结果监督奖励模型（ORM）；3）确保持续改进的自适应强化学习策略。我们应用WebRL将开源的Llama-3.1和GLM-4模型转化为熟练的网络智能体。在WebArena-Lite基准测试中，WebRL将Llama-3.1-8B的成功率从4.8%提升至42.4%，GLM-4-9B的成功率从6.1%提升至43%。这些开源模型的性能显著超越了GPT-4-Turbo（17.6%）和GPT-4o（13.9%），并优于先前基于开源LLM训练的最先进网络智能体（AutoWebGLM，18.2%）。我们的研究结果证明了WebRL在弥合开源与专有LLM网络智能体性能差距方面的有效性，为构建更易获取且更强大的自主网络交互系统开辟了道路。