Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.
翻译:语言智能体是基于语言模型构建的系统,能够与复杂环境(如开放网络)进行交互。本研究探讨此类智能体能否在网络中执行现实且耗时的任务,例如监控房地产市场或定位相关周边企业。我们提出了AssistantBench——一个包含214项可自动评估的现实任务、涵盖不同场景与领域的新型挑战性基准测试。研究发现,AssistantBench揭示了当前系统(包括语言模型与检索增强语言模型)的局限性:所有模型的准确率均未超过25分。尽管闭卷语言模型表现良好,但其精确度较低,倾向于虚构事实;最先进的网络智能体得分接近零。此外,我们提出了SeePlanAct新型网络智能体,其性能显著优于现有智能体,且SPA与闭卷模型的集成方案达到了最佳综合性能。进一步分析表明,当前系统的主要失败原因在于网络导航仍是重大挑战。