Efficiently solving real-world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search-R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human-inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)) to improve the model's generalization abilities. Despite using significantly less training data than Search-R1, BrowserAgent achieves more competitive results across different Open-QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model's reasoning capabilities for long-horizon tasks. Notably, BrowserAgent-7B can achieve around 20\% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.
翻译:利用大语言模型高效解决现实世界问题,日益依赖于其与动态网络环境交互并自主获取外部信息的能力。尽管近期如Search-R1和WebDancer等研究在解决网络任务方面展现出强大性能,但它们严重依赖额外工具将交互式网络环境转换为静态文本内容。这与人类的网页浏览行为形成对比,后者涉及与浏览器的多样化交互,如滚动、点击和输入。本文提出BrowserAgent,一种更具交互性的智能体,通过人类启发的浏览器行为解决复杂任务。BrowserAgent通过一组预定义的浏览器操作,借助Playwright直接在原始网页上运行。我们采用两阶段训练(监督微调(SFT)和拒绝微调(RFT))来提升模型的泛化能力。尽管使用的训练数据量显著少于Search-R1,BrowserAgent在不同开放问答任务中取得了更具竞争力的结果。此外,我们引入显式记忆机制来存储跨步骤的关键结论,进一步增强了模型在长视野任务中的推理能力。值得注意的是,BrowserAgent-7B在HotpotQA、2Wiki和Bamboogle等多跳问答任务上相比Search-R1可取得约20%的性能提升。这些结果表明,BrowserAgent可以作为一个更先进的框架,用于构建更具交互性和可扩展性的网络智能体。