Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on new information. In this paper, we argue that we should instead move towards evaluating and building multi-agent computer use (MACU) systems. These systems, which emphasize planning and parallel execution, alleviate many of the shortcomings of single-agent CUAs. We propose a general multi-agent setup in which a manager model decomposes computer use tasks as a directed acyclic graph (DAG), encoding relevant dependencies and goals for subagents. At each iteration, the manager dispatches parallel CUA subagents to carry out nodes on the ready frontier of the DAG, and continuously revises the DAG (adding, canceling, or rewriting nodes) as new findings arrive from subagents. This design treats the partially observable environment of computer use as a first class challenge: information that downstream agents may not be able to re-observe are retained and passed forward through the manager and DAG structure. We demonstrate that MACU consistently improves over strong single-agent baselines by $3.4-25.5\%$ on desktop (OSWorld) and web navigation (Online-Mind2Web, WebTailBench, Odysseys) benchmarks, exhibits more favorable test-time scaling, and solves complex long-horizon tasks where single-agent CUAs get stuck. On Odysseys, a long-horizon web navigation benchmark, MACU improves average task completion wall-clock time by ${\sim} 1.5 \times$, demonstrating its efficacy in speeding up traditionally slow CUA pipelines. Our findings highlight that multi-agent coordination is a promising axis for scaling computer use agents to work productively for longer and more effectively. We release all code and interactive visualizations at https://jykoh.com/multi-agent-computer-use.
翻译:当前的计算机使用代理(CUA)主要以单序列代理的形式部署。这种设置对于需要任务分解、并行执行以及基于新信息持续重规划的复杂长时域任务而言并非最优。本文主张,应转向评估和构建多智能体计算机使用(MACU)系统。这些系统强调规划与并行执行,能够缓解单智能体CUA的诸多不足。我们提出了一种通用的多智能体设置:由一个管理者模型将计算机使用任务分解为有向无环图(DAG),编码子代理的相关依赖关系与目标。在每次迭代中,管理者将并行CUA子代理调度至DAG就绪前沿上的节点,并随着子代理反馈的新发现持续修订DAG(添加、取消或重写节点)。该设计将计算机使用的部分可观测环境作为首要挑战处理:下游代理可能无法重新观测到的信息通过管理者和DAG结构得以保留和传递。实验表明,在桌面端(OSWorld)和网页导航(Online-Mind2Web、WebTailBench、Odysseys)基准测试中,MACU相较强劲的单代理基线一致提升3.4-25.5%,展现出更优的测试时扩展特性,并能解决单代理CUA无法完成的复杂长时域任务。在长时域网页导航基准Odysseys上,MACU将平均任务完成墙钟时间提升约1.5倍,证明了其在加速传统缓慢CUA流水线方面的有效性。我们的发现强调,多智能体协调是扩展计算机使用代理以更持久、高效工作的有前景方向。所有代码与交互式可视化资料已发布于 https://jykoh.com/multi-agent-computer-use。