In recent years, significant progress has been made in the field of robotic reinforcement learning (RL), enabling methods that handle complex image observations, train in the real world, and incorporate auxiliary data, such as demonstrations and prior experience. However, despite these advances, robotic RL remains hard to use. It is acknowledged among practitioners that the particular implementation details of these algorithms are often just as important (if not more so) for performance as the choice of algorithm. We posit that a significant challenge to widespread adoption of robotic RL, as well as further development of robotic RL methods, is the comparative inaccessibility of such methods. To address this challenge, we developed a carefully implemented library containing a sample efficient off-policy deep RL method, together with methods for computing rewards and resetting the environment, a high-quality controller for a widely-adopted robot, and a number of challenging example tasks. We provide this library as a resource for the community, describe its design choices, and present experimental results. Perhaps surprisingly, we find that our implementation can achieve very efficient learning, acquiring policies for PCB board assembly, cable routing, and object relocation between 25 to 50 minutes of training per policy on average, improving over state-of-the-art results reported for similar tasks in the literature. These policies achieve perfect or near-perfect success rates, extreme robustness even under perturbations, and exhibit emergent recovery and correction behaviors. We hope that these promising results and our high-quality open-source implementation will provide a tool for the robotics community to facilitate further developments in robotic RL. Our code, documentation, and videos can be found at https://serl-robot.github.io/
翻译:近年来,机器人强化学习领域取得了显著进展,相关方法能够处理复杂的图像观测、在真实世界中进行训练,并整合辅助数据(如示范和先验经验)。然而,尽管取得这些进展,机器人强化学习的实际应用仍面临困难。从业者普遍认为,在算法性能方面,这些算法的具体实现细节往往与算法选择同等重要(甚至更为关键)。我们认为,机器人强化学习方法的相对难获取性是其广泛采用及进一步发展的主要障碍。为解决这一挑战,我们开发了一套精心实现的代码库,其中包含一种样本高效的非策略深度强化学习方法,以及计算奖励和重置环境的方法、适用于广泛采用机器人的高质量控制器,以及一系列具有挑战性的示例任务。我们将该代码库作为社区资源提供,阐述其设计选择,并展示实验结果。令人意外的是,我们发现该实现可实现极其高效的训练:平均每项策略仅需25至50分钟的训练时间,即可成功学习PCB板组装、线缆布线及物体重定位等任务的控制策略,优于文献中同类任务的最新结果。这些策略达到了完美或近乎完美的成功率,即使在扰动下也表现出极强的鲁棒性,并展现出涌现性的恢复与纠错行为。我们期待这些令人鼓舞的结果以及高质量的开源实现,能够为机器人社区提供工具,推动机器人强化学习的进一步发展。我们的代码、文档及演示视频可通过https://serl-robot.github.io/获取。