OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.

翻译：数十年来，人机交互本质上一直是手动的。时至今日，计算机上几乎所有生产性工作仍需每一步的人工输入。自主虚拟代理代表了自动化这些琐碎任务的一个激动人心的方向。虚拟代理将赋予技术能力有限的用户充分利用计算机系统全部可能性的能力，同时能以最少的人工干预高效简化从日历管理到复杂旅行预订等众多计算机任务。本文提出OmniACT——首个评估代理生成可执行程序以完成计算机任务能力的专用数据集与基准。我们的研究范围超越传统网页自动化，覆盖广泛的桌面应用程序。数据集包含"播放下一首歌曲"等基础任务，以及"发送邮件给John Doe告知会面时间和地点"等长周期任务。具体而言，给定屏幕图像与视觉化自然语言任务配对，目标是生成能完整执行该任务的脚本。我们在基准上测试了多个强基线语言模型代理。最强基线GPT-4在基准中表现最优，但其生成可执行脚本以完成任务的能力仅达到人类熟练度的15%，凸显了该任务对传统网页代理的挑战。本基准为衡量和评估语言模型代理在计算机任务自动化中的进展提供了平台，并推动未来构建桥接大语言模型与计算机屏幕视觉理解的多模态模型研究。