OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.

翻译：数十年来，人机交互本质上始终依赖人工操作。即便在今天，计算机上几乎所有生产性工作仍需在每个步骤中人工介入。自主虚拟智能体代表了自动化处理诸多繁琐任务方面令人振奋的进展——这类智能体将赋能技术受限的用户充分利用计算机系统的全部潜力，同时能以最少人工干预高效统筹日历管理、复杂旅行预订等大量计算机任务。本文提出OmniACT，这是首个用于评估智能体生成可执行程序完成计算机任务能力的专用数据集与基准。我们的研究范畴超越传统网页自动化，覆盖多样化的桌面应用程序。数据集包含“播放下一首歌曲”等基础任务，以及“向John Doe发送邮件告知会议时间地点”等长周期任务。具体而言，给定屏幕图像与视觉定位的自然语言任务配对，目标是生成能完整执行该任务的脚本。我们在基准上测试了多个强基线语言模型智能体，其中表现最优的GPT-4的脚本生成能力仅达到人类熟练度的15%，充分彰显了该任务对传统网页智能体的挑战性。本基准为衡量和评估语言模型智能体在计算机任务自动化领域的进展提供了平台，同时激励未来研究构建能桥接大语言模型与计算机屏幕视觉定位的多模态模型。