Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting-based agents on existing benchmarks -- ScribeAgent achieves state-of-the-art direct generation performance on Mind2Web and improves the task success rate by 7.3% over the previous best text-only web agents on WebArena. We further perform detailed ablation studies on various fine-tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.
翻译:大型语言模型(LLM)智能体正快速提升处理日益复杂网络任务的能力。现有智能体大多依赖GPT-4等通用型专有模型,并通过优化提示设计来增强规划能力。然而,通用LLM并未针对HTML等专业网络语境进行专门训练,且在长程规划任务中常显不足。本研究探索了一种替代方案:利用从250余个领域收集、规模达60亿标记的生产级工作流数据对开源LLM进行微调。这种简洁高效的方法在现有基准测试中显著超越了基于提示的智能体——ScribeAgent在Mind2Web数据集上实现了最先进的直接生成性能,并在WebArena上将任务成功率较先前最佳纯文本网络智能体提升了7.3%。我们进一步通过消融实验深入分析了多种微调设计选择,为LLM选型、训练方案、上下文窗口优化及数据集规模效应提供了重要见解。