Cradle: Empowering Foundation Agents Towards General Computer Control

Weihao Tan,Wentao Zhang,Xinrun Xu,Haochong Xia,Ziluo Ding,Boyu Li,Bohan Zhou,Junpeng Yue,Jiechuan Jiang,Yewen Li,Ruyi An,Molei Qin,Chuqiao Zong,Longtao Zheng,Yujie Wu,Xiaoqiang Chai,Yifei Bi,Tianbao Xie,Pengjie Gu,Xiyun Li,Ceyao Zhang,Long Tian,Chaojie Wang,Xinrun Wang,Börje F. Karlsson,Bo An,Shuicheng Yan,Zongqing Lu

Despite the success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Cradle can understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games, five software applications, and a comprehensive benchmark, OSWorld. Cradle is the first to enable foundation agents to follow the main storyline and complete 40-minute-long real missions in the complex AAA game Red Dead Redemption 2 (RDR2). Cradle can also create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain with a maximal weekly total profit of 87% in Dealer's Life 2. Cradle can not only operate daily software, like Chrome, Outlook, and Feishu, but also edit images and videos using Meitu and CapCut. Cradle greatly extends the reach of foundation agents by enabling the easy conversion of any software, especially complex games, into benchmarks to evaluate agents' various abilities and facilitate further data collection, thus paving the way for generalist agents.

翻译：尽管在特定场景中取得了成功，但现有的基础智能体仍然难以泛化到各种虚拟场景，这主要归因于环境封装方式差异巨大，且观测与动作空间多为人工设计。为解决这一问题，我们提出了通用计算机控制（GCC）设定，将基础智能体与软件的交互限制在最统一、标准化的接口上，即以屏幕截图作为输入，键盘和鼠标动作作为输出。我们引入了Cradle，一个模块化、灵活的LMM驱动框架，作为迈向GCC的初步尝试。通过六个关键模块的增强，Cradle能够理解输入的屏幕截图，并在高层规划后输出可执行代码以实现低层级的键盘和鼠标控制，从而使其能够与任何软件交互，完成长周期复杂任务，且无需依赖任何内置API。实验结果表明，Cradle在四个先前未经探索的商业视频游戏、五个软件应用以及一个综合性基准测试OSWorld中，展现出卓越的泛化能力和令人印象深刻的表现。Cradle首次实现了基础智能体在复杂的AAA游戏《荒野大镖客2》（RDR2）中遵循主线剧情并完成长达40分钟的真实任务。Cradle还能在《城市：天际线》中创建千人城市，在《星露谷物语》中种植并收获防风草，以及在《商人生活2》中进行交易和议价，实现最高达87%的周总利润。Cradle不仅能操作日常软件，如Chrome、Outlook和飞书，还能使用美图秀秀和剪映进行图像和视频编辑。Cradle极大地扩展了基础智能体的应用范围，使得任何软件（尤其是复杂游戏）都能轻松转化为基准测试，以评估智能体的各项能力并促进进一步的数据收集，从而为通用智能体的发展铺平道路。