基于基础先验的强化学习：让具身智能体高效自主学习 (Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own)

Reinforcement learning (RL) is a promising approach for solving robotic manipulation tasks. However, it is challenging to apply the RL algorithms directly in the real world. For one thing, RL is data-intensive and typically requires millions of interactions with environments, which are impractical in real scenarios. For another, it is necessary to make heavy engineering efforts to design reward functions manually. To address these issues, we leverage foundation models in this paper. We propose Reinforcement Learning with Foundation Priors (RLFP) to utilize guidance and feedback from policy, value, and success-reward foundation models. Within this framework, we introduce the Foundation-guided Actor-Critic (FAC) algorithm, which enables embodied agents to explore more efficiently with automatic reward functions. The benefits of our framework are threefold: (1) \textit{sample efficient}; (2) \textit{minimal and effective reward engineering}; (3) \textit{agnostic to foundation model forms and robust to noisy priors}. Our method achieves remarkable performances in various manipulation tasks on both real robots and in simulation. Across 5 dexterous tasks with real robots, FAC achieves an average success rate of 86\% after one hour of real-time learning. Across 8 tasks in the simulated Meta-world, FAC achieves 100\% success rates in 7/8 tasks under less than 100k frames (about 1-hour training), outperforming baseline methods with manual-designed rewards in 1M frames. We believe the RLFP framework can enable future robots to explore and learn autonomously in the physical world for more tasks.

翻译：强化学习是解决机器人操作任务的一种有前景的方法。然而，直接将强化学习算法应用于现实世界具有挑战性。一方面，强化学习是数据密集型的，通常需要与环境进行数百万次交互，这在现实场景中是不切实际的。另一方面，需要投入大量工程努力来手动设计奖励函数。为了解决这些问题，本文利用基础模型。我们提出了基于基础先验的强化学习，以利用来自策略、价值和成功奖励基础模型的指导与反馈。在此框架内，我们引入了基础引导的演员-评论家算法，该算法使具身智能体能够通过自动奖励函数进行更高效的探索。我们框架的优势有三点：(1) \textit{样本高效}；(2) \textit{奖励工程最小化且有效}；(3) \textit{与基础模型形式无关，并对噪声先验具有鲁棒性}。我们的方法在真实机器人和仿真中的各种操作任务上均取得了显著性能。在5个真实机器人灵巧操作任务中，FAC在经过一小时实时学习后平均成功率达到了86%。在模拟的Meta-world中的8个任务上，FAC在少于10万帧（约1小时训练）内，在7/8的任务中实现了100%的成功率，其表现优于需要手动设计奖励并在100万帧内训练的基线方法。我们相信，RLFP框架能使未来的机器人在物理世界中为更多任务进行自主探索和学习。