The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual process theory in cognitive science, which suggests two parallel systems of fast and slow thinking in human decision-making, we introduce Robotics with Fast and Slow Thinking (RFST), a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Our RFST consists of two key components: 1) an instruction discriminator to determine which system should be activated based on the current user instruction, and 2) a slow-thinking system that is comprised of a fine-tuned vision language model aligned with the policy networks, which allows the robot to recognize user intention or perform reasoning tasks. To assess our methodology, we built a dataset featuring real-world trajectories, capturing actions ranging from spontaneous impulses to tasks requiring deliberate contemplation. Our results, both in simulation and real-world scenarios, confirm that our approach adeptly manages intricate tasks that demand intent recognition and reasoning. The project is available at https://jlm-z.github.io/RSFT/
翻译:语言条件化机器人操控旨在将自然语言指令转化为可执行动作,涵盖从简单抓取放置到需要意图识别与视觉推理的复杂任务。受认知科学中双过程理论的启发——该理论指出人类决策中存在快慢思考两种并行系统——我们提出了“快慢思考机器人”(RFST)框架,该框架模拟人类认知架构,通过指令类型对任务进行分类并在两个系统间做出决策。RFST包含两个核心组件:1) 指令判别器,用于根据当前用户指令确定应激活哪个系统;2) 慢思考系统,由与策略网络对齐的微调视觉语言模型组成,使机器人能够识别用户意图或执行推理任务。为评估该方法,我们构建了包含真实世界轨迹的数据集,涵盖了从即时冲动的动作到需要深思熟虑的任务行为。仿真与真实场景的实验结果均证实,该方法能有效处理需要意图识别与推理的复杂任务。项目代码详见 https://jlm-z.github.io/RSFT/