Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

Weihao Tan,Ziluo Ding,Wentao Zhang,Boyu Li,Bohan Zhou,Junpeng Yue,Haochong Xia,Jiechuan Jiang,Longtao Zheng,Xinrun Xu,Yifei Bi,Pengjie Gu,Xinrun Wang,Börje F. Karlsson,Bo An,Zongqing Lu

Despite the success in specific tasks and scenarios, existing foundation agents, empowered by large models (LMs) and advanced tools, still cannot generalize to different scenarios, mainly due to dramatic differences in the observations and actions across scenarios. In this work, we propose the General Computer Control (GCC) setting: building foundation agents that can master any computer task by taking only screen images (and possibly audio) of the computer as input, and producing keyboard and mouse operations as output, similar to human-computer interaction. The main challenges of achieving GCC are: 1) the multimodal observations for decision-making, 2) the requirements of accurate control of keyboard and mouse, 3) the need for long-term memory and reasoning, and 4) the abilities of efficient exploration and self-improvement. To target GCC, we introduce Cradle, an agent framework with six main modules, including: 1) information gathering to extract multi-modality information, 2) self-reflection to rethink past experiences, 3) task inference to choose the best next task, 4) skill curation for generating and updating relevant skills for given tasks, 5) action planning to generate specific operations for keyboard and mouse control, and 6) memory for storage and retrieval of past experiences and known skills. To demonstrate the capabilities of generalization and self-improvement of Cradle, we deploy it in the complex AAA game Red Dead Redemption II, serving as a preliminary attempt towards GCC with a challenging target. To our best knowledge, our work is the first to enable LMM-based agents to follow the main storyline and finish real missions in complex AAA games, with minimal reliance on prior knowledge or resources. The project website is at https://baai-agents.github.io/Cradle/.

翻译：尽管在特定任务和场景中取得了成功，但现有基于大模型与先进工具的智能体仍无法泛化至不同场景，这主要源于各场景在观测空间与动作空间上的巨大差异。本文提出通用计算机控制设定：构建能通过仅接收计算机屏幕图像（及可能的音频）作为输入，并生成键盘鼠标操作作为输出的基础智能体，使其类似人机交互般掌握任意计算机任务。实现GCC的主要挑战包括：1）用于决策的多模态观测信息处理，2）对键盘鼠标精确控制的需求，3）长期记忆与推理能力，4）高效探索与自我优化能力。针对GCC目标，我们提出包含六大模块的智能体框架Cradle：1）信息采集模块提取多模态信息，2）自我反思模块回顾过往经验，3）任务推理模块选择最优下一步任务，4）技能编排模块生成并更新任务相关技能，5）动作规划模块生成键盘鼠标具体操作，6）记忆模块存储检索历史经验与已知技能。为验证Cradle的泛化与自我优化能力，我们在复杂3A游戏《荒野大镖客：救赎2》中部署该系统，以此作为对GCC的初步探索与挑战性验证。据我们所知，本工作是首个使基于大语言模型的智能体能够遵循主线剧情、完成复杂3A游戏中真实任务的研究，且对先验知识与外部资源的依赖降至最低。项目网站见https://baai-agents.github.io/Cradle/。