Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own

Reinforcement learning (RL) is a promising approach for solving robotic manipulation tasks. However, it is challenging to apply the RL algorithms directly in the real world. For one thing, RL is data-intensive and typically requires millions of interactions with environments, which are impractical in real scenarios. For another, it is necessary to make heavy engineering efforts to design reward functions manually. To address these issues, we leverage foundation models in this paper. We propose Reinforcement Learning with Foundation Priors (RLFP) to utilize guidance and feedback from policy, value, and success-reward foundation models. Within this framework, we introduce the Foundation-guided Actor-Critic (FAC) algorithm, which enables embodied agents to explore more efficiently with automatic reward functions. The benefits of our framework are threefold: (1) \textit{sample efficient}; (2) \textit{minimal and effective reward engineering}; (3) \textit{agnostic to foundation model forms and robust to noisy priors}. Our method achieves remarkable performances in various manipulation tasks on both real robots and in simulation. Across 5 dexterous tasks with real robots, FAC achieves an average success rate of 86\% after one hour of real-time learning. Across 8 tasks in the simulated Meta-world, FAC achieves 100\% success rates in 7/8 tasks under less than 100k frames (about 1-hour training), outperforming baseline methods with manual-designed rewards in 1M frames. We believe the RLFP framework can enable future robots to explore and learn autonomously in the physical world for more tasks. Visualizations and code are available at https://yewr.github.io/rlfp.

翻译：强化学习（RL）是解决机器人操作任务的一种有前景的方法。然而，将RL算法直接应用于现实世界仍面临挑战。一方面，RL具有数据密集特性，通常需要与环境进行数百万次交互，这在真实场景中难以实现。另一方面，手动设计奖励函数需要大量工程投入。为解决这些问题，本文利用基础模型提出了一种基于先验知识的强化学习框架（RLFP），通过整合策略、价值与成功奖励三类基础模型的指导与反馈。在该框架下，我们提出了基础引导型演员-评论家算法（FAC），使具身智能体能够借助自动奖励函数实现更高效的探索。本框架的优势体现在三方面：（1）样本高效性；（2）最少且有效的奖励工程；（3）对基础模型形式无关性及对噪声先验的鲁棒性。该方法在真实机器人和仿真环境的多项操作任务中均取得了显著性能。在涉及真实机器人的5项灵巧操作任务中，FAC经过1小时实时学习后平均成功率可达86%。在仿真Meta-world环境的8项任务中，FAC在少于10万帧（约1小时训练）的情况下，于7/8任务中实现了100%成功率，显著优于使用人工设计奖励函数在100万帧训练下的基线方法。我们相信RLFP框架能够推动未来机器人自主探索学习更多物理世界任务。可视化演示与代码已发布于https://yewr.github.io/rlfp。