AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks, such as making a smiley face using building blocks. These tasks often involve complex multi-step reasoning, presenting significant challenges due to the limited paired data connecting human instructions (e.g., making a smiley face) and robot actions (e.g., end-effector movement). Existing approaches relieve this challenge by adopting an open-loop paradigm decomposing high-level instructions into simple sub-task plans, and executing them step-by-step using low-level control models. However, these approaches are short of instant observations in multi-step reasoning, leading to sub-optimal results. To address this issue, we propose to automatically collect a cognitive robot dataset by Large Language Models (LLMs). The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation sequences. To enable efficient data acquisition, we employ elaborated multi-round prompt designs that effectively reduce the burden of extensive human involvement. We further propose a closed-loop multi-modal embodied planning model that autoregressively generates plans by taking image observations as input. To facilitate effective learning, we leverage MiniGPT-4 with a frozen visual encoder and LLM, and finetune additional vision adapter and Q-former to enable fine-grained spatial perception for manipulation tasks. We conduct experiments to verify the superiority over existing open and closed-loop methods, and achieve a significant increase in success rate by 21.4% and 14.5% over ChatGPT and GPT-4 based robot tasks. Real-world demos are shown in https://www.youtube.com/watch?v=ayAzID1_qQk .

翻译：我们提出了一种新型框架，用于学习机器人操作任务中的高级认知能力，例如使用积木搭建笑脸。这类任务通常涉及复杂的多步推理，由于人类指令（如"搭建一个笑脸"）与机器人动作（如末端执行器运动）之间的配对数据有限，带来了显著挑战。现有方法通过采用开环范式来缓解这一问题，将高级指令分解为简单的子任务计划，并利用低级控制模型逐步执行。然而，这些方法在多步推理中缺乏即时观测，导致次优结果。为解决这一难题，我们提出通过大型语言模型自动收集认知机器人数据集。生成的数据集AlphaBlock包含35个全面的高级任务，涵盖多步文本计划及配对观测序列。为实现高效数据采集，我们设计了精细的多轮提示设计，有效减少了大量人工参与的需求。我们进一步提出了一种闭环多模态具身规划模型，该模型以图像观测为输入，通过自回归方式生成计划。为促进有效学习，我们利用带冻结视觉编码器和语言模型的MiniGPT-4，微调额外的视觉适配器和Q-former，以实现操作任务中的细粒度空间感知。实验验证了该方法相较于现有开环和闭环方法的优越性，在基于ChatGPT和GPT-4的机器人任务中成功率达显著提升21.4%和14.5%。真实世界演示见 https://www.youtube.com/watch?v=ayAzID1_qQk。