Text-based web agents offer computational efficiency for autonomous web navigation, yet developing robust agents remains challenging due to the noisy and heterogeneous nature of real-world HTML. Standard Supervised Fine-Tuning (SFT) approaches fail in two critical dimensions: they lack discrimination capabilities to reject plausible but incorrect elements in densely populated pages, and exhibit limited generalization to unseen website layouts. To address these challenges, we introduce the Triton dataset (590k instances) and a progressive training curriculum. Triton is constructed via Structural-Semantic Hard Negative Mining, which explicitly mines topologically similar distractors, and a Dual-Agent Consensus pipeline that synthesizes diverse cross-domain tasks with strict verification. Building upon this foundation, our progressive curriculum produces three models: Triton-SFT-32B for basic imitation, Triton-ORPO-32B for robust discrimination via Odds Ratio Preference Optimization, and Triton-GRPO-32B for long-horizon consistency through Group Relative Policy Optimization. Empirical evaluation on Mind2Web demonstrates that Triton-GRPO-32B achieves state-of-the-art performance among open-source models with 58.7% Step Success Rate, surpassing GPT-4.5 (42.4%) and Claude-4.5 (41.4%) by over 16%, validating that specialized data curriculum outweighs raw parameter scale for web navigation.
翻译:基于文本的网页智能体通过高效计算实现自主网页导航,但由于现实世界HTML的嘈杂性与异构性,开发鲁棒型智能体仍具挑战。标准监督微调方法在两个关键维度存在缺陷:缺乏区分密集页面中看似合理但实际错误的元素判别能力,且对未见网站布局的泛化能力有限。为解决这些问题,我们引入Triton数据集(59万实例)与渐进式训练课程。Triton通过结构语义硬负例挖掘(显式提取拓扑相似干扰项)与双智能体共识流水线(结合严格验证生成多样化跨域任务)构建。基于此基础,我们的渐进式课程产生三个模型:用于基础模仿的Triton-SFT-32B、通过比值比偏好优化实现鲁棒判别的Triton-ORPO-32B,以及通过群体相对策略优化实现长程一致性的Triton-GRPO-32B。在Mind2Web上的实证评估表明,Triton-GRPO-32B以58.7%的步骤成功率在开源模型中达到最优性能,超越GPT-4.5(42.4%)与Claude-4.5(41.4%)逾16%,验证了针对网页导航任务,专业化数据课程重于原始参数规模这一结论。