Large language model agents are increasingly used to automate web tasks such as product search, offer comparison, and checkout. Current research explores different interfaces through which these agents interact with websites, including traditional HTML browsing, retrieval-augmented generation (RAG) over pre-crawled content, communication via Web APIs using the Model Context Protocol (MCP), and natural-language querying through the NLWeb interface. However, no prior work has compared these four architectures within a single controlled environment using identical tasks. To address this gap, we introduce a testbed consisting of four simulated e-shops, each offering its products via HTML, MCP, and NLWeb interfaces. For each interface (HTML, RAG, MCP, and NLWeb) we develop specialized agents that perform the same sets of tasks, ranging from simple product searches and price comparisons to complex queries for complementary or substitute products and checkout processes. We evaluate the agents using GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4 as underlying LLM. Our evaluation shows that the RAG, MCP and NLWeb agents outperform HTML on both effectiveness and efficiency. Averaged over all tasks, F1 rises from 0.67 for HTML to between 0.75 and 0.77 for the other agents. Token usage falls from about 241k for HTML to between 47k and 140k per task. The runtime per task drops from 291 seconds to between 50 and 62 seconds. The best overall configuration is RAG with GPT 5 achieving an F1 score of 0.87 and a completion rate of 0.79. Also taking cost into consideration, RAG with GPT 5 mini offers a good compromise between API usage fees and performance. Our experiments show the choice of the interaction interface has a substantial impact on both the effectiveness and efficiency of LLM-based web agents.
翻译:大型语言模型代理日益广泛地应用于自动化网络任务,如产品搜索、报价比较和结账流程。当前研究探索了这些代理与网站交互的不同接口,包括传统HTML浏览、基于预爬取内容的检索增强生成(RAG)、通过模型上下文协议(MCP)的Web API通信,以及通过NLWeb接口的自然语言查询。然而,尚无研究在统一受控环境中使用相同任务对这四种架构进行系统比较。为填补这一空白,我们构建了一个包含四个模拟电商平台的测试环境,每个平台均提供HTML、MCP和NLWeb接口的产品访问。针对每种接口(HTML、RAG、MCP、NLWeb),我们开发了执行相同任务集的专用代理,任务范围涵盖从简单产品搜索与价格比较,到复杂互补/替代产品查询及结账流程。我们使用GPT 4.1、GPT 5、GPT 5 mini和Claude Sonnet 4作为底层LLM对代理进行评估。实验结果表明,RAG、MCP和NLWeb代理在效能与效率方面均优于HTML代理。所有任务平均F1分数从HTML的0.67提升至其他代理的0.75-0.77;单任务token消耗从HTML的约241k降至47k-140k;单任务运行时间从291秒缩短至50-62秒。最佳整体配置为基于GPT 5的RAG代理,其F1分数达0.87,任务完成率为0.79。综合考虑成本因素,基于GPT 5 mini的RAG代理在API使用费用与性能间取得了良好平衡。本实验证明交互接口的选择对基于LLM的网络代理的效能与效率具有显著影响。