Solving real-world complex tasks using reinforcement learning (RL) without high-fidelity simulation environments or large amounts of offline data can be quite challenging. Online RL agents trained in imperfect simulation environments can suffer from severe sim-to-real issues. Offline RL approaches although bypass the need for simulators, often pose demanding requirements on the size and quality of the offline datasets. The recently emerged hybrid offline-and-online RL provides an attractive framework that enables joint use of limited offline data and imperfect simulator for transferable policy learning. In this paper, we develop a new algorithm, called H2O+, which offers great flexibility to bridge various choices of offline and online learning methods, while also accounting for dynamics gaps between the real and simulation environment. Through extensive simulation and real-world robotics experiments, we demonstrate superior performance and flexibility over advanced cross-domain online and offline RL algorithms.
翻译:利用强化学习解决现实复杂任务时,若缺乏高保真仿真环境或大量离线数据,将极具挑战性。在非完美仿真环境中训练的在线强化学习代理可能遭受严重的仿真到现实问题。离线强化学习方法虽规避了对仿真器的需求,但往往对离线数据集的大小和质量提出严苛要求。近期出现的混合离线和在线强化学习提供了一种有吸引力的框架,可在有限离线数据与非完美仿真器之间实现联合使用,支持可迁移策略学习。本文提出一种名为H2O+的新算法,该算法在兼顾真实环境与仿真环境动态差异的同时,展现出极大的灵活性以桥接离线和在线学习方法的不同选择。通过广泛的仿真和真实世界机器人实验,我们证明了该方法在性能与灵活性上均优于先进的跨领域在线与离线强化学习算法。