Autonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of large language models (LLMs) has markedly accelerated progress in building digital agents. However, most of these agents are designed to interact with a narrow domain, such as a specific software or website. This narrow focus constrains their applicability for general computer tasks. To this end, we introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications. We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks. We also present numerical and quantitative evidence that FRIDAY learns to control and self-improve on Excel and Powerpoint with minimal supervision. Our OS-Copilot framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents.
翻译:自主与计算机交互一直是具有巨大潜力的长期挑战,而近年来大型语言模型(LLMs)的蓬勃发展显著加速了数字智能体的构建进程。然而,现有大多数智能体被设计用于与狭窄领域交互(如特定软件或网站),这种局限性制约了其在通用计算机任务中的应用。为此,我们提出OS-Copilot框架,用于构建能够与操作系统(OS)中全面元素(包括网页、代码终端、文件、多媒体及各类第三方应用)交互的通用型智能体。基于OS-Copilot,我们创建了FRIDAY——一个用于自动化通用计算机任务的自我改进具身智能体。在通用AI助手基准测试GAIA上,FRIDAY的性能较先前方法提升35%,并通过积累先前任务技能展现出对未见过应用场景的强大泛化能力。我们还提供数值与定性证据表明,FRIDAY能在极少监督下自主学习控制Excel与PowerPoint并实现自我改进。我们的OS-Copilot框架与实证发现为未来研发更强大、更通用的计算机智能体提供了基础设施与理论洞见。