Autonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of large language models (LLMs) has markedly accelerated progress in building digital agents. However, most of these agents are designed to interact with a narrow domain, such as a specific software or website. This narrow focus constrains their applicability for general computer tasks. To this end, we introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications. We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks. We also present numerical and quantitative evidence that FRIDAY learns to control and self-improve on Excel and Powerpoint with minimal supervision. Our OS-Copilot framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents.
翻译:自主与计算机交互是一项具有巨大潜力的长期挑战,而近年来大语言模型(LLMs)的蓬勃发展显著加速了数字智能体的构建进程。然而,现有智能体大多被设计用于与特定软件或网站等狭窄领域交互,这种局限性制约了其在通用计算机任务中的应用。为此,我们提出OS-Copilot框架,旨在构建能够与操作系统(OS)中全面元素(包括网页、代码终端、文件、多媒体及各类第三方应用程序)交互的通用智能体。基于该框架,我们创建了FRIDAY——一个用于自动化通用计算机任务的自我改进具身智能体。在通用AI助手基准测试GAIA中,FRIDAY以35%的性能优势超越此前方法,展现了通过积累先前任务技能对未见应用实现强泛化的能力。我们同时提供数值与定量证据表明,FRIDAY能在极少监督条件下学习控制Excel和PowerPoint并实现自我改进。本研究的OS-Copilot框架与实证发现将为未来更强大、更通用的计算机智能体研究提供基础设施与洞见。