We study building embodied agents for open-ended creative tasks. While existing methods build instruction-following agents that can perform diverse open-ended tasks, none of them demonstrates creativity -- the ability to give novel and diverse task solutions implicit in the language instructions. This limitation comes from their inability to convert abstract language instructions into concrete task goals in the environment and perform long-horizon planning for such complicated goals. Given the observation that humans perform creative tasks with the help of imagination, we propose a class of solutions for creative agents, where the controller is enhanced with an imaginator that generates detailed imaginations of task outcomes conditioned on language instructions. We introduce several approaches to implementing the components of creative agents. We implement the imaginator with either a large language model for textual imagination or a diffusion model for visual imagination. The controller can either be a behavior-cloning policy learned from data or a pre-trained foundation model generating executable codes in the environment. We benchmark creative tasks with the challenging open-world game Minecraft, where the agents are asked to create diverse buildings given free-form language instructions. In addition, we propose novel evaluation metrics for open-ended creative tasks utilizing GPT-4V, which holds many advantages over existing metrics. We perform a detailed experimental analysis of creative agents, showing that creative agents are the first AI agents accomplishing diverse building creation in the survival mode of Minecraft. Our benchmark and models are open-source for future research on creative agents (https://github.com/PKU-RL/Creative-Agents).
翻译:我们研究构建用于开放式创意任务的具身智能体。尽管现有方法已能构建指令跟随型智能体来执行各类开放式任务,但尚无方法展现创造力——即从语言指令中生成隐含的创新性多样化任务解决方案的能力。这一局限源于智能体无法将抽象语言指令转化为环境中的具体任务目标,并针对此类复杂目标进行长程规划。基于人类借助想象力完成创意任务的观察,我们提出一类针对创意体的解决方案:通过增强型"想象器"(imaginator)强化控制器,该想象器能基于语言指令生成任务结果的细致想象。我们引入多种方法实现创意体的各组件:文本想象场景中采用大语言模型构建想象器,视觉想象场景则使用扩散模型;控制器既可采用从数据中习得的行为克隆策略,也可通过预训练基础模型生成环境中的可执行代码。我们以具有挑战性的开放世界游戏《我的世界》作为创意任务基准测试环境,要求智能体根据自由形式的语言指令创建多样化建筑。此外,我们提出利用GPT-4V评估开放式创意任务的新型评价指标,该指标相比现有指标具有显著优势。通过详尽的实验分析,我们证明创意体是首个能在《我的世界》生存模式中完成多样化建筑创建的AI智能体。我们的基准测试与模型已开源供后续创意体研究使用(https://github.com/PKU-RL/Creative-Agents)。