It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways. However, existing approaches often fail to steadily follow instructions due to difficulties in understanding abstract and sequential natural language instructions. To this end, we introduce MineDreamer, an open-ended embodied agent built upon the challenging Minecraft simulator with an innovative paradigm that enhances instruction-following ability in low-level control signal generation. Specifically, MineDreamer is developed on top of recent advances in Multimodal Large Language Models (MLLMs) and diffusion models, and we employ a Chain-of-Imagination (CoI) mechanism to envision the step-by-step process of executing instructions and translating imaginations into more precise visual prompts tailored to the current state; subsequently, the agent generates keyboard-and-mouse actions to efficiently achieve these imaginations, steadily following the instructions at each step. Extensive experiments demonstrate that MineDreamer follows single and multi-step instructions steadily, significantly outperforming the best generalist agent baseline and nearly doubling its performance. Moreover, qualitative analysis of the agent's imaginative ability reveals its generalization and comprehension of the open world.
翻译:设计一个能以类人方式遵循多样化指令的通用具身智能体是一个长期目标。然而,现有方法常因难以理解抽象且连续的序列化自然语言指令而无法稳定遵循指令。为此,我们提出MineDreamer——一种基于具挑战性的Minecraft模拟器构建的开放式具身智能体,其创新范式通过优化低级控制信号的生成来提升指令遵循能力。具体而言,MineDreamer基于多模态大语言模型和扩散模型的最新进展开发,并采用想象链机制来设想执行指令的逐步过程,将想象转化为更精准的、适配当前状态的视觉提示。随后,智能体生成键盘与鼠标动作以高效实现这些想象,在每一步骤中稳定遵循指令。大量实验表明,MineDreamer能稳定遵循单步与多步指令,显著超越最佳通用智能体基线模型,性能近乎翻倍。此外,对智能体想象能力的定性分析揭示了其对开放世界的泛化与理解能力。