Robots equipped with reinforcement learning (RL) have the potential to learn a wide range of skills solely from a reward signal. However, obtaining a robust and dense reward signal for general manipulation tasks remains a challenge. Existing learning-based approaches require significant data, such as demonstrations or examples of success and failure, to learn task-specific reward functions. Recently, there is also a growing adoption of large multi-modal foundation models for robotics. These models can perform visual reasoning in physical contexts and generate coarse robot motions for various manipulation tasks. Motivated by this range of capability, in this work, we propose and study rewards shaped by vision-language models (VLMs). State-of-the-art VLMs have demonstrated an impressive ability to reason about affordances through keypoints in zero-shot, and we leverage this to define dense rewards for robotic learning. On a real-world manipulation task specified by natural language description, we find that these rewards improve the sample efficiency of autonomous RL and enable successful completion of the task in 20K online finetuning steps. Additionally, we demonstrate the robustness of the approach to reductions in the number of in-domain demonstrations used for pretraining, reaching comparable performance in 35K online finetuning steps.
翻译:配备强化学习(RL)的机器人具备仅从奖励信号学习广泛技能的潜力。然而,为通用操作任务获取稳健且密集的奖励信号仍具挑战。现有基于学习的方法需要大量数据(如成功与失败的演示或示例)来学习任务特定的奖励函数。近期,大型多模态基础模型在机器人领域的应用也日益增长。这些模型能够在物理环境中进行视觉推理,并为各类操作任务生成粗略的机器人运动轨迹。受此能力范围的启发,本研究提出并探索了由视觉语言模型(VLMs)构建的奖励函数。最先进的视觉语言模型已展现出通过关键点进行零样本具身化推理的卓越能力,我们利用这一特性为机器人学习定义密集奖励。在通过自然语言描述指定的真实世界操作任务中,我们发现此类奖励提升了自主强化学习的样本效率,并能在20K次在线微调步数内成功完成任务。此外,我们验证了该方法对预训练所用领域内演示数据量减少的鲁棒性,在35K次在线微调步数内即可达到相当的性能水平。