Surfer 2：新一代跨平台计算机使用智能体 (Surfer 2: The Next Generation of Cross-Platform Computer Use Agents)

Mathieu Andreux,Märt Bakler,Yanael Barbier,Hamza Benchekroun,Emilien Biré,Antoine Bonnet,Riaz Bordie,Nathan Bout,Matthias Brunel,Aleix Cambray,Pierre-Louis Cedoz,Antoine Chassang,Gautier Cloix,Ethan Connelly,Alexandra Constantinou,Ramzi De Coster,Hubert de la Jonquiere,Aurélien Delfosse,Maxime Delpit,Alexis Deprez,Augustin Derupti,Mathieu Diaz,Shannon D'Souza,Julie Dujardin,Abai Edmund,Michael Eickenberg,Armand Fatalot,Wissem Felissi,Isaac Herring,Xavier Koegler,Erwan Le Jumeau de Kergaradec,Aurélien Lac,Maxime Langevin,Corentin Lauverjat,Antonio Loison,Avshalom Manevich,Axel Moyal,Axel Nguyen Kerbel,Marinela Parovic,Julien Revelle,Guillaume Richard,Mats Richter,Ronan Riochet,María Santos,Romain Savidan,Laurent Sifre,Maxime Theillard,Marc Thibault,Ivan Valentini,Tony Wu,Laura Yie,Kai Yuan,Jevgenij Zubovskij

from arxiv, 21 pages, 9 figures, 2 tables

Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.

翻译：构建能够在网络、桌面和移动环境中通用的智能体仍然是一个开放挑战，因为现有系统依赖于特定环境的接口，限制了跨平台部署。我们推出了Surfer 2，这是一种完全基于视觉观察的统一架构，在所有三种环境中均实现了最先进的性能。Surfer 2集成了分层上下文管理、解耦的规划与执行，以及具备自适应恢复能力的自我验证机制，从而能够在长任务跨度上实现可靠操作。我们的系统在WebVoyager上达到97.1%的准确率，在WebArena上达到69.6%，在OSWorld上达到60.1%，在AndroidWorld上达到87.1%，无需针对特定任务进行微调即超越了所有现有系统。通过多次尝试，Surfer 2在所有基准测试中均超越了人类表现。这些结果表明，系统化的编排能够放大基础模型的能力，并仅通过视觉交互实现通用计算机控制，同时呼吁开发新一代视觉语言模型以实现帕累托最优的成本效益。