Reinforcement learning (RL) has demonstrated its capability in solving various tasks but is notorious for its low sample efficiency. In this paper, we propose RLingua, a framework that can leverage the internal knowledge of large language models (LLMs) to reduce the sample complexity of RL in robotic manipulations. To this end, we first present a method for extracting the prior knowledge of LLMs by prompt engineering so that a preliminary rule-based robot controller for a specific task can be generated in a user-friendly manner. Despite being imperfect, the LLM-generated robot controller is utilized to produce action samples during rollouts with a decaying probability, thereby improving RL's sample efficiency. We employ TD3, the widely-used RL baseline method, and modify the actor loss to regularize the policy learning towards the LLM-generated controller. RLingua also provides a novel method of improving the imperfect LLM-generated robot controllers by RL. We demonstrate that RLingua can significantly reduce the sample complexity of TD3 in four robot tasks of panda_gym and achieve high success rates in 12 sampled sparsely rewarded robot tasks in RLBench, where the standard TD3 fails. Additionally, We validated RLingua's effectiveness in real-world robot experiments through Sim2Real, demonstrating that the learned policies are effectively transferable to real robot tasks. Further details about our work are available at our project website https://rlingua.github.io.
翻译:强化学习(RL)已展现出解决各类任务的能力,但其样本效率低下问题众所周知。本文提出RLingua框架,该框架利用大型语言模型(LLMs)的内部知识来降低机器人操作中强化学习的样本复杂度。为此,我们首先提出一种通过提示工程提取LLMs先验知识的方法,从而能够以用户友好的方式生成针对特定任务的基于规则的初步机器人控制器。尽管该控制器并不完美,但在交互过程中以衰减概率采用LLM生成的机器人控制器生成动作样本,从而提升强化学习的样本效率。我们采用广泛使用的强化学习基线方法TD3,并修改其演员损失函数,使策略学习向LLM生成的控制器进行正则化。RLingua还提供了一种通过强化学习改进不完美的LLM生成机器人控制器的新方法。实验表明,RLingua在panda_gym的四个机器人任务中显著降低了TD3的样本复杂度,并在RLBench中12个采样稀疏奖励机器人任务上实现了高成功率,而标准TD3在这些任务中完全失败。此外,我们通过Sim2Real在真实机器人实验中验证了RLingua的有效性,证明所学策略可有效迁移至真实机器人任务。更多工作详情可查阅项目网站https://rlingua.github.io。