Reinforcement learning (RL) has demonstrated its capability in solving various tasks but is notorious for its low sample efficiency. In this paper, we propose RLingua, a framework that can leverage the internal knowledge of large language models (LLMs) to reduce the sample complexity of RL in robotic manipulations. To this end, we first present how to extract the prior knowledge of LLMs by prompt engineering so that a preliminary rule-based robot controller for a specific task can be generated. Despite being imperfect, the LLM-generated robot controller is utilized to produce action samples during rollouts with a decaying probability, thereby improving RL's sample efficiency. We employ the actor-critic framework and modify the actor loss to regularize the policy learning towards the LLM-generated controller. RLingua also provides a novel method of improving the imperfect LLM-generated robot controllers by RL. We demonstrated that RLingua can significantly reduce the sample complexity of TD3 in the robot tasks of panda_gym and achieve high success rates in sparsely rewarded robot tasks in RLBench, where the standard TD3 fails. Additionally, We validated RLingua's effectiveness in real-world robot experiments through Sim2Real, demonstrating that the learned policies are effectively transferable to real robot tasks. Further details and videos about our work are available at our project website https://rlingua.github.io.
翻译:强化学习(RL)已展现出解决各种任务的能力,但因其样本效率低下而闻名。在本文中,我们提出RLingua框架,该框架能够利用大语言模型(LLM)的内部知识来降低机器人操作中RL的样本复杂度。为此,我们首先介绍了如何通过提示工程提取LLM的先验知识,从而生成针对特定任务的初步基于规则的机器人控制器。尽管不完美,但LLM生成的机器人控制器在滚动过程中以衰减概率用于生成动作样本,从而提升RL的样本效率。我们采用行动者-评论家框架,并修改行动者损失以将策略学习正则化至LLM生成的控制器方向。RLingua还提供了一种通过RL改进不完美LLM生成控制器的新方法。我们证明,RLingua可以显著降低panda_gym机器人任务中TD3的样本复杂度,并在RLBench中标准TD3失败的稀疏奖励机器人任务中实现高成功率。此外,我们通过Sim2Real在真实世界机器人实验中验证了RLingua的有效性,表明学习到的策略可有效迁移至真实机器人任务。关于我们工作的更多细节和视频,请访问项目网站https://rlingua.github.io。