Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent's long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.
翻译:基于大型语言模型(LLM)的智能体已发展为数字环境(包括移动界面、操作系统和网络浏览器)中的强大自主控制器。以网页导航为例,该任务需要处理动态内容和长序列动作,极具挑战性。现有基于LLM的智能体在长程规划中存在两个主要问题:在线执行时,随着新信息的出现,智能体常会丢失目标,缺乏清晰且自适应的最终目标路径;这一问题在强化学习(RL)微调过程中进一步加剧——稀疏且延迟的奖励信号使智能体难以识别关键决策行为,导致其无法在长周期任务中保持连贯推理。针对上述挑战,我们提出两项贡献:首先,引入一个利用专有模型进行子目标分解的在线规划智能体框架;其次,提出MiRA(强化学习增强智能体的里程碑奖励框架),一种基于密集里程碑奖励信号的RL训练框架。实时规划机制可将Gemini等专有模型在WebArena-Lite基准测试中的成功率(SR)绝对值提升约10%。同时,将MiRA应用于开源Gemma3-12B模型后,其成功率从6.4%跃升至43.0%,超越了GPT-4-Turbo(17.6%)、GPT-4o(13.9%)等专有系统,以及此前开源模型的最高水平WebRL(38.4%)。总体而言,我们的研究证明,将显式推理时规划与里程碑奖励相结合,能显著提升智能体的长程任务能力,为构建更鲁棒、更通用的自主系统开辟了新路径。