Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld, which serve both as evaluation tools and as training environments for reinforcement learning. However, macOS remains underserved in this landscape: the only existing benchmark, macOSWorld, covers a narrow slice of first-party applications with simpler tasks, and runs on x86 virtual machines incompatible with Apple Silicon. We introduce MacArena, a benchmark of 421 manually verified tasks spanning 50 applications that combines a curated port of OSWorld tasks, content sourced from macOSWorld, and 49 new macOS-native tasks, all running on Apple's native Virtualization framework on Apple Silicon. We argue that macOS presents distinct GUI challenges beyond what Linux-based benchmarks capture, and our evaluation supports this claim: strong model performance on existing benchmarks can reflect familiarity with task distributions rather than genuine cross-platform GUI competence. Notably, model rankings invert between ported and macOS-native tasks, with a leading model trailing by over 26% on the MacArena subset, suggesting that macOS poses a genuinely harder environment for current GUI agents.
翻译:计算机使用代理通过视觉与控制基元操作图形用户界面,其能力在OSWorld等标准化在线评估基准的推动下快速提升——这类基准既作为评估工具,也充当强化学习的训练环境。然而,macOS在此领域中仍缺乏充分支持:现有唯一基准macOSWorld仅覆盖少量原生应用程序的基础任务,且运行在不兼容Apple Silicon的x86虚拟机上。我们提出MacArena基准,包含涵盖50款应用的421个经人工验证的任务。该基准整合了经过适配处理的OSWorld任务、来源于macOSWorld的内容,以及49个原生macOS任务,所有任务均运行在Apple Silicon的Apple原生虚拟化框架上。我们认为macOS展现了Linux类基准无法捕获的独特GUI挑战,实验评估支持这一论断:模型在现有基准上的优异表现可能反映其熟悉任务分布特性,而非真正的跨平台GUI能力。值得注意的是,模型排名在移植任务与原生macOS任务间出现反转,领先模型在MacArena子集上落后超过26%,这表明macOS对当前GUI代理构成更具挑战性的环境。