The recently released model, Claude 3.5 Computer Use, stands out as the first frontier AI model to offer computer use in public beta as a graphical user interface (GUI) agent. As an early beta, its capability in the real-world complex environment remains unknown. In this case study to explore Claude 3.5 Computer Use, we curate and organize a collection of carefully designed tasks spanning a variety of domains and software. Observations from these cases demonstrate Claude 3.5 Computer Use's unprecedented ability in end-to-end language to desktop actions. Along with this study, we provide an out-of-the-box agent framework for deploying API-based GUI automation models with easy implementation. Our case studies aim to showcase a groundwork of capabilities and limitations of Claude 3.5 Computer Use with detailed analyses and bring to the fore questions about planning, action, and critic, which must be considered for future improvement. We hope this preliminary exploration will inspire future research into the GUI agent community. All the test cases in the paper can be tried through the project: https://github.com/showlab/computer_use_ootb.
翻译:近期发布的Claude 3.5 Computer Use模型,是首个以图形用户界面智能体形式向公众开放测试的前沿人工智能模型。作为早期测试版本,其在现实复杂环境中的实际能力尚不明确。在本项针对Claude 3.5 Computer Use的案例研究中,我们系统构建并整理了一系列跨领域、跨软件的精心设计任务。案例观察表明,该模型在端到端的语言指令到桌面操作转换方面展现出前所未有的能力。伴随本研究,我们同时提供了一套开箱即用的智能体框架,可便捷部署基于API的图形界面自动化模型。本案例研究旨在通过详细分析,系统展示Claude 3.5 Computer Use的能力基础与局限,并着重提出关于任务规划、动作执行与效果评估等亟待未来改进的核心问题。我们希望这项初步探索能为图形界面智能体领域的研究提供启发。论文中所有测试案例均可通过项目地址进行验证:https://github.com/showlab/computer_use_ootb。