We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.
翻译:无论是在日常生活还是工作中,我们都与计算机进行日常交互,许多工作环节仅需借助计算机和互联网即可完成。与此同时,得益于大语言模型(LLMs)的进步,能够与环境交互并施加影响的AI智能体也取得了快速发展。但AI智能体在协助加速乃至自主执行工作任务方面的表现究竟如何?这个问题的答案对寻求将AI融入工作流程的产业界,以及理解AI应用可能对劳动力市场产生影响的经济政策制定者都具有重要意义。为衡量这些LLM智能体在执行现实专业任务方面的进展,本文提出TheAgentCompany——一个可扩展的基准测试框架,用于评估以数字化工作者类似方式与世界交互的AI智能体:包括浏览网页、编写代码、运行程序以及与同事沟通。我们构建了一个包含内部网站和数据的自包含环境,模拟小型软件公司的工作场景,并设计了该公司员工可能执行的多样化任务。通过测试基于闭源API和开源权重语言模型(LMs)的基线智能体,我们发现当前最具竞争力的智能体能够自主完成24%的任务。这为LM智能体的任务自动化描绘了一幅细致图景——在模拟真实工作场景中,较简单的任务已能实现自主处理,但更具挑战性的长周期任务仍超出当前系统的能力范围。