While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.
翻译:尽管视觉语言模型(VLMs)显著推动了计算机使用智能体(CUAs)的发展,但现有框架在长流程工作流的鲁棒性和新领域的泛化能力方面仍面临挑战。这些局限性源于对历史视觉上下文管理缺乏细粒度控制,以及缺少视觉感知的教程检索机制。为弥补这些不足,我们提出了OS-Symphony——一个包含协调两大关键创新模块的编排器的整体框架,旨在实现鲁棒的自动化:(1) 反思记忆智能体,利用里程碑驱动的长期记忆实现轨迹级自我修正,有效缓解长流程任务中的视觉上下文丢失问题;(2) 多功能工具智能体,其多模态搜索器采用SeeAct范式在基于浏览器的沙盒环境中导航,以合成实时、视觉对齐的教程,从而解决未知场景中的保真度问题。实验结果表明,OS-Symphony在不同规模的模型上均带来显著的性能提升,在三个在线基准测试中创造了新的最先进结果,特别是在OSWorld基准上达到了65.84%的准确率。