Web browsers are a portal to the internet, where much of human activity is undertaken. Thus, there has been significant research work in AI agents that interact with the internet through web browsing. However, there is also another interface designed specifically for machine interaction with online content: application programming interfaces (APIs). In this paper we ask -- what if we were to take tasks traditionally tackled by browsing agents, and give AI agents access to APIs? To do so, we propose two varieties of agents: (1) an API-calling agent that attempts to perform online tasks through APIs only, similar to traditional coding agents, and (2) a Hybrid Agent that can interact with online data through both web browsing and APIs. In experiments on WebArena, a widely-used and realistic benchmark for web navigation tasks, we find that API-based agents outperform web browsing agents. Hybrid Agents out-perform both others nearly uniformly across tasks, resulting in a more than 20.0% absolute improvement over web browsing alone, achieving a success rate of 35.8%, achiving the SOTA performance among task-agnostic agents. These results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.
翻译:网络浏览器是通往互联网的门户,承载着人类大部分线上活动。因此,通过网页浏览与互联网交互的AI智能体已获得大量研究关注。然而,还存在另一种专为机器与在线内容交互设计的接口:应用程序编程接口(API)。本文探讨:若将传统由浏览智能体处理的任务,改为让AI智能体通过API访问,会产生何种效果?为此,我们提出两类智能体:(1)仅通过API执行在线任务的API调用智能体,其工作模式类似于传统编程智能体;(2)可通过网页浏览和API两种方式与在线数据交互的混合智能体。在WebArena(一个广泛使用且贴近现实的网络导航任务基准测试)上的实验表明,基于API的智能体表现优于网页浏览智能体。混合智能体在几乎所有任务中均超越前两者,相比纯网页浏览方案取得超过20.0%的绝对性能提升,成功率达到35.8%,在任务无关型智能体中实现了最先进的性能。这些结果充分表明,当API可用时,其可作为纯网页浏览方案极具吸引力的替代选择。