OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie,Danyang Zhang,Jixuan Chen,Xiaochuan Li,Siheng Zhao,Ruisheng Cao,Toh Jing Hua,Zhoujun Cheng,Dongchan Shin,Fangyu Lei,Yitao Liu,Yiheng Xu,Shuyan Zhou,Silvio Savarese,Caiming Xiong,Victor Zhong,Tao Yu

from arxiv, 51 pages, 21 figures

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

翻译：能够在最少人工干预下完成复杂计算机任务的自主智能体，具有革新人机交互模式的潜力，可显著提升可访问性与工作效率。然而，现有基准测试要么缺乏交互环境，要么局限于特定应用或领域的专用环境，未能反映真实世界计算机使用场景的多样性与复杂性，从而限制了任务范围和智能体的可扩展性。为解决这一问题，我们提出了OSWorld——首个可扩展的、面向多模态智能体的真实计算机环境，支持跨Ubuntu、Windows和macOS等多种操作系统的任务配置、基于执行的评估及交互式学习。OSWorld可作为统一的集成计算机环境，用于评估涉及任意应用程序的开放式计算机任务。基于OSWorld，我们构建了一个包含369项计算机任务的基准测试集，涵盖开放领域的真实网页/桌面应用、操作系统文件I/O以及跨多应用的工作流程。每个任务实例均源自真实计算机使用场景，包含详细的初始状态配置说明和定制的基于执行的评估脚本，确保评估的可靠性与可复现性。在OSWorld上对基于前沿LLM/VLM的智能体进行广泛评估，发现其作为计算机助手的能力存在显著不足：人类可完成72.36%以上的任务，而最佳模型仅达成12.24%的成功率，主要瓶颈在于图形界面理解与操作知识。通过OSWorld进行的综合分析，为开发多模态通用智能体提供了以往基准测试无法实现的重要洞见。我们的代码、环境、基线模型与数据已公开于https://os-world.github.io。