In recent years, significant progress has been made in the field of robotic reinforcement learning (RL), enabling methods that handle complex image observations, train in the real world, and incorporate auxiliary data, such as demonstrations and prior experience. However, despite these advances, robotic RL remains hard to use. It is acknowledged among practitioners that the particular implementation details of these algorithms are often just as important (if not more so) for performance as the choice of algorithm. We posit that a significant challenge to widespread adoption of robotic RL, as well as further development of robotic RL methods, is the comparative inaccessibility of such methods. To address this challenge, we developed a carefully implemented library containing a sample efficient off-policy deep RL method, together with methods for computing rewards and resetting the environment, a high-quality controller for a widely-adopted robot, and a number of challenging example tasks. We provide this library as a resource for the community, describe its design choices, and present experimental results. Perhaps surprisingly, we find that our implementation can achieve very efficient learning, acquiring policies for PCB board assembly, cable routing, and object relocation between 25 to 50 minutes of training per policy on average, improving over state-of-the-art results reported for similar tasks in the literature. These policies achieve perfect or near-perfect success rates, extreme robustness even under perturbations, and exhibit emergent recovery and correction behaviors. We hope that these promising results and our high-quality open-source implementation will provide a tool for the robotics community to facilitate further developments in robotic RL. Our code, documentation, and videos can be found at https://serl-robot.github.io/
翻译:近年来,机器人强化学习领域取得了显著进展,涌现出能够处理复杂图像观测、在现实世界中训练,并整合辅助数据(如示范演示与先验经验)的方法。然而尽管有这些突破,机器人强化学习仍然难以使用。从业者普遍认识到,这些算法的具体实现细节对性能的影响往往与算法选择同等重要(甚至更为关键)。我们认为,机器人强化学习技术难以广泛普及及其方法难以进一步发展的主要障碍在于此类方法的相对难获取性。为解决这一挑战,我们开发了一套精心实现的算法库,其中包含样本高效的离策略深度强化学习方法,以及奖励计算与环境重置方法、针对广泛采用机器人的高质量控制器,以及若干具有挑战性的示例任务。我们向社区提供该算法库作为共享资源,详细阐述了其设计选择,并展示了实验结果。令人意外的是,我们发现该实现能够实现极其高效的训练——以平均每策略25至50分钟的训练时长,分别获得适用于PCB板组装、线缆布线及物体搬运任务的策略,这优于文献中同类任务的最新结果。这些策略实现了完美或接近完美的成功率,即使在干扰下也展现出极强的鲁棒性,并涌现出自动恢复与纠偏行为。我们期待这些富有前景的成果与高质量开源实现能为机器人社区提供工具,推动机器人强化学习的进一步发展。相关代码、文档及视频可访问 https://serl-robot.github.io/ 获取。