From Imitation to Discrimination: Progressive Curriculum Learning for Robust Web Navigation

Text-based web agents offer computational efficiency for autonomous web navigation, yet developing robust agents remains challenging due to the noisy and heterogeneous nature of real-world HTML. Standard Supervised Fine-Tuning (SFT) approaches fail in two critical dimensions: they lack discrimination capabilities to reject plausible but incorrect elements in densely populated pages, and exhibit limited generalization to unseen website layouts. To address these challenges, we introduce the Triton dataset (590k instances) and a progressive training curriculum. Triton is constructed via Structural-Semantic Hard Negative Mining, which explicitly mines topologically similar distractors, and a Dual-Agent Consensus pipeline that synthesizes diverse cross-domain tasks with strict verification. Building upon this foundation, our progressive curriculum produces three models: Triton-SFT-32B for basic imitation, Triton-ORPO-32B for robust discrimination via Odds Ratio Preference Optimization, and Triton-GRPO-32B for long-horizon consistency through Group Relative Policy Optimization. Empirical evaluation on Mind2Web demonstrates that Triton-GRPO-32B achieves state-of-the-art performance among open-source models with 58.7% Step Success Rate, surpassing GPT-4.5 (42.4%) and Claude-4.5 (41.4%) by over 16%, validating that specialized data curriculum outweighs raw parameter scale for web navigation.

翻译：基于文本的网页智能体通过高效计算实现自主网页导航，但由于现实世界HTML的嘈杂性与异构性，开发鲁棒型智能体仍具挑战。标准监督微调方法在两个关键维度存在缺陷：缺乏区分密集页面中看似合理但实际错误的元素判别能力，且对未见网站布局的泛化能力有限。为解决这些问题，我们引入Triton数据集（59万实例）与渐进式训练课程。Triton通过结构语义硬负例挖掘（显式提取拓扑相似干扰项）与双智能体共识流水线（结合严格验证生成多样化跨域任务）构建。基于此基础，我们的渐进式课程产生三个模型：用于基础模仿的Triton-SFT-32B、通过比值比偏好优化实现鲁棒判别的Triton-ORPO-32B，以及通过群体相对策略优化实现长程一致性的Triton-GRPO-32B。在Mind2Web上的实证评估表明，Triton-GRPO-32B以58.7%的步骤成功率在开源模型中达到最优性能，超越GPT-4.5（42.4%）与Claude-4.5（41.4%）逾16%，验证了针对网页导航任务，专业化数据课程重于原始参数规模这一结论。

相关内容

课程

关注 6

课程是指学校学生所应学习的学科总和及其进程与安排。课程是对教育的目标、教学内容、教学活动方式的规划和设计，是教学计划、教学大纲等诸多方面实施过程的总和。广义的课程是指学校为实现培养目标而选择的教育内容及其进程的总和，它包括学校老师所教授的各门学科和有目的、有计划的教育活动。狭义的课程是指某一门学科。专知上对国内外最新AI+X的课程进行了收集与索引，涵盖斯坦福大学、CMU、MIT、清华、北大等名校开放课程。

《推进多智能体系统：面向可扩展与鲁棒的学习与控制》200页

专知会员服务

12+阅读 · 5月13日

《面向大语言模型引导规划、Bandit算法驱动探索与多智能体导航的分层决策问题研究》180页

专知会员服务

16+阅读 · 4月16日

《面向大语言模型引导规划、赌徒驱动探索与多智能体导航的分层决策》最新180页

专知会员服务

27+阅读 · 2025年11月17日

【牛津大学博士论文】图神经网络鲁棒机器学习，173页pdf

专知会员服务

28+阅读 · 2024年5月15日