Sparse rewards pose a significant challenge to achieving high sample efficiency in goal-conditioned reinforcement learning (RL). Specifically, in sequential manipulation tasks, the agent receives failure rewards until it successfully completes the entire manipulation task, which leads to low sample efficiency. To tackle this issue and improve sample efficiency, we propose a novel model-based RL framework called Model-based Relay Hindsight Experience Replay (MRHER). MRHER breaks down a continuous task into subtasks with increasing complexity and utilizes the previous subtask to guide the learning of the subsequent one. Instead of using Hindsight Experience Replay (HER) in every subtask, we design a new robust model-based relabeling method called Foresight relabeling (FR). FR predicts the future trajectory of the hindsight state and relabels the expected goal as a goal achieved on the virtual future trajectory. By incorporating FR, MRHER effectively captures more information from historical experiences, leading to improved sample efficiency, particularly in object-manipulation environments. Experimental results demonstrate that MRHER exhibits state-of-the-art sample efficiency in benchmark tasks, outperforming RHER by 13.79% and 14.29% in the FetchPush-v1 environment and FetchPickandPlace-v1 environment, respectively.
翻译:稀疏奖励对目标条件强化学习实现高样本效率构成了重大挑战。具体而言,在序列操控任务中,智能体在成功完成整个操控任务前仅获得失败奖励,这导致样本效率低下。为解决此问题并提升样本效率,我们提出了一种新颖的基于模型的强化学习框架,称为基于模型的接力事后经验回放。该框架将连续任务分解为复杂度递增的子任务,并利用前序子任务指导后续子任务的学习。我们未在每个子任务中使用事后经验回放,而是设计了一种新的鲁棒基于模型重标注方法,称为前瞻重标注。该方法预测事后状态的未来轨迹,并将预期目标重标注为虚拟未来轨迹上达成的目标。通过整合前瞻重标注,该框架能有效从历史经验中捕获更多信息,从而提升样本效率,尤其在物体操控环境中。实验结果表明,该框架在基准任务中展现出最先进的样本效率,在FetchPush-v1环境和FetchPickandPlace-v1环境中分别以13.79%和14.29%的优势超越RHER。