The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual process theory in cognitive science, which suggests two parallel systems of fast and slow thinking in human decision-making, we introduce Robotics with Fast and Slow Thinking (RFST), a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Our RFST consists of two key components: 1) an instruction discriminator to determine which system should be activated based on the current user instruction, and 2) a slow-thinking system that is comprised of a fine-tuned vision language model aligned with the policy networks, which allows the robot to recognize user intention or perform reasoning tasks. To assess our methodology, we built a dataset featuring real-world trajectories, capturing actions ranging from spontaneous impulses to tasks requiring deliberate contemplation. Our results, both in simulation and real-world scenarios, confirm that our approach adeptly manages intricate tasks that demand intent recognition and reasoning. The project is available at https://jlm-z.github.io/RSFT/
翻译:语言条件下的机器人操作旨在将自然语言指令转化为可执行动作,涵盖从简单的抓取放置到需要意图识别与视觉推理的任务。受认知科学中双系统理论(即人类决策中快速与慢速思考并行的双系统机制)启发,我们提出"机器人快速与慢速思考"(RFST)框架。该框架模拟人类认知架构,依据指令类型对任务进行分类,并基于双系统进行决策。RFST包含两个核心组件:1)指令判别器,用于根据当前用户指令决定激活哪个系统;2)慢速思考系统,由与策略网络对齐的微调视觉语言模型构成,使机器人能够识别用户意图或执行推理任务。为评估该方法,我们构建了包含真实世界轨迹的数据集,涵盖从瞬时冲动到需要深思熟虑的任务动作。仿真与真实场景的实验结果表明,该方法能高效处理需要意图识别与推理的复杂任务。项目地址:https://jlm-z.github.io/RSFT/