CoAct-1: Computer-using Multi-Agent System with Coding Actions

Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

翻译：通过图形用户界面（GUI）操作计算机的自主智能体在处理复杂、长周期的任务时，往往在效率和可靠性方面存在不足。虽然通过规划器增强这些智能体可以改进任务分解，但它们仍然受限于所有操作均需通过GUI交互完成的固有局限，导致系统脆弱且效率低下。在本工作中，我们引入了一种更鲁棒、更灵活的范式：使智能体能够将编码作为一种增强型动作。我们提出了CoAct-1，一种新颖的多智能体系统，它协同结合了基于GUI的控制与直接的程序化执行。CoAct-1包含一个编排器，能够动态地将子任务分配给传统的GUI操作员或一个专门的程序员智能体，后者可以编写并执行Python或Bash脚本。这种混合方法使智能体能够绕过低效的GUI操作序列来处理诸如文件管理和数据处理等任务，同时在必要时仍可利用视觉交互。我们在具有挑战性的OSWorld基准测试上评估了我们的系统，CoAct-1实现了60.76%的最新最优成功率，显著超越了先前的方法。此外，我们的方法大幅提升了效率，将完成任务所需的平均步骤数降至仅10.15步，而领先的GUI智能体则需要15步。我们的结果表明，将编码作为核心动作进行整合，为通用计算机自动化提供了一条更强大、更高效且更具可扩展性的路径。