We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 29 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.
翻译:我们研究了基于大型语言模型的代理通过网页浏览器与软件交互的应用。与先前工作不同,我们专注于衡量代理在执行知识工作者日常工作中典型任务的能力,这些任务涉及企业软件系统的使用。为此,我们提出了WorkArena,一个基于广泛使用的ServiceNow平台的29个任务的远程托管基准测试。同时,我们引入了BrowserGym,一个用于设计和评估此类代理的环境,提供了丰富的操作集以及多模态观察。我们的实证评估表明,尽管当前代理在WorkArena上展现出潜力,但在实现完全任务自动化方面仍存在显著差距。值得注意的是,我们的分析揭示了开源与闭源大型语言模型之间的性能差异,突出了该领域未来探索与发展的关键方向。