In recent years, significant progress has been made in the field of robotic reinforcement learning (RL), enabling methods that handle complex image observations, train in the real world, and incorporate auxiliary data, such as demonstrations and prior experience. However, despite these advances, robotic RL remains hard to use. It is acknowledged among practitioners that the particular implementation details of these algorithms are often just as important (if not more so) for performance as the choice of algorithm. We posit that a significant challenge to widespread adoption of robotic RL, as well as further development of robotic RL methods, is the comparative inaccessibility of such methods. To address this challenge, we developed a carefully implemented library containing a sample efficient off-policy deep RL method, together with methods for computing rewards and resetting the environment, a high-quality controller for a widely-adopted robot, and a number of challenging example tasks. We provide this library as a resource for the community, describe its design choices, and present experimental results. Perhaps surprisingly, we find that our implementation can achieve very efficient learning, acquiring policies for PCB board assembly, cable routing, and object relocation between 25 to 50 minutes of training per policy on average, improving over state-of-the-art results reported for similar tasks in the literature. These policies achieve perfect or near-perfect success rates, extreme robustness even under perturbations, and exhibit emergent recovery and correction behaviors. We hope that these promising results and our high-quality open-source implementation will provide a tool for the robotics community to facilitate further developments in robotic RL. Our code, documentation, and videos can be found at https://serl-robot.github.io/
翻译:近年来,机器人强化学习领域取得了显著进展,相关方法能够处理复杂图像观测、在现实世界中训练,并整合演示与先验经验等辅助数据。然而,尽管存在这些进步,机器人强化学习的应用仍然困难重重。从业者普遍认为,算法特定的实现细节对性能的影响往往与算法选择本身同等重要(甚至更为关键)。我们提出,机器人强化学习广泛采用及其方法进一步发展的重大障碍在于此类方法的相对不可及性。为解决这一挑战,我们开发了一套精心实现的库,包含样本高效的非策略深度强化学习方法,以及奖励计算与环境重置方法、面向广泛采用机器人的高质量控制器,以及一系列具有挑战性的示例任务。我们将该库作为社区资源发布,阐述其设计选择,并展示实验结果。出人意料的是,我们发现该实现能够实现极为高效的学习——平均每个策略仅需25至50分钟的训练时间,即可获得用于PCB板组装、线缆布线及物体重新定位的策略,超越了文献中同类任务的最先进结果。这些策略可实现完美或接近完美的成功率,在扰动下仍具有极强的鲁棒性,并展现出应急恢复与纠错行为。我们期望这些令人振奋的结果及高质量的开源实现,能为机器人社区提供工具,推动机器人强化学习的进一步发展。相关代码、文档及视频请见https://serl-robot.github.io/