Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.
翻译:构建能够在网络、桌面和移动环境中通用的智能体仍然是一个开放挑战,因为现有系统依赖于特定环境的接口,限制了跨平台部署。我们推出了Surfer 2,这是一种完全基于视觉观察的统一架构,在所有三种环境中均实现了最先进的性能。Surfer 2集成了分层上下文管理、解耦的规划与执行,以及具备自适应恢复能力的自我验证机制,从而能够在长任务跨度上实现可靠操作。我们的系统在WebVoyager上达到97.1%的准确率,在WebArena上达到69.6%,在OSWorld上达到60.1%,在AndroidWorld上达到87.1%,无需针对特定任务进行微调即超越了所有现有系统。通过多次尝试,Surfer 2在所有基准测试中均超越了人类表现。这些结果表明,系统化的编排能够放大基础模型的能力,并仅通过视觉交互实现通用计算机控制,同时呼吁开发新一代视觉语言模型以实现帕累托最优的成本效益。