We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at https://github.com/simular-ai/Agent-S.
翻译:本文提出Agent S,一种通过图形用户界面实现计算机自主交互的开放智能体框架,旨在通过自动化复杂多步骤任务来改变人机交互模式。Agent S致力于解决计算机任务自动化中的三个关键挑战:领域知识获取、长程任务规划以及动态非标准化界面处理。为此,Agent S提出经验增强型分层规划方法,通过多层级外部知识搜索与内部经验检索进行学习,从而提升任务规划与子任务执行效率。同时,框架采用智能体-计算机接口,以更好地激发基于多模态大语言模型的GUI智能体的推理与控制能力。在OSWorld基准测试中,Agent S以9.37%的成功率优势超越基线方法(相对提升83.6%),创造了新的性能记录。综合分析验证了各模块的有效性,并为未来改进提供了方向。此外,在新发布的WindowsAgentArena基准测试中,Agent S展现出对不同操作系统的广泛泛化能力。代码已开源:https://github.com/simular-ai/Agent-S。