Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate Robot-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.
翻译:大型视觉语言模型(LVLMs)近期通过结合具身推理与机器人控制,在推动机器人技术发展方面展现出巨大潜力。当前主流方法通常采用监督微调(SFT)在机器人控制相关的具身推理任务上进行训练。然而,SFT数据集往往基于启发式方法构建,并未针对提升机器人控制性能进行显式优化。此外,SFT常导致灾难性遗忘和泛化性能下降等问题。为突破这些限制,我们提出了Robot-R1——一个创新性框架,该框架利用强化学习技术专门增强面向机器人控制的具身推理能力。Robot-R1通过学习预测完成任务所需的下一个关键点状态,其预测条件基于当前场景图像及从专家示范中提取的环境元数据。受DeepSeek-R1学习范式启发,Robot-R1对基于推理的响应进行采样,并对那些能产生更准确预测的响应进行强化。为系统评估Robot-R1,我们还构建了一个新型基准测试,该测试要求模型具备执行任务所需的多样化具身推理能力。实验表明,采用Robot-R1训练的模型在具身推理任务上全面超越SFT方法。尽管仅拥有70亿参数,Robot-R1在空间推理与运动推理等低层动作控制相关任务上,其推理能力甚至超越了GPT-4o。