Mastering deep reinforcement learning (DRL) proves challenging in tasks featuring scant rewards. These limited rewards merely signify whether the task is partially or entirely accomplished, necessitating various exploration actions before the agent garners meaningful feedback. Consequently, the majority of existing DRL exploration algorithms struggle to acquire practical policies within a reasonable timeframe. To address this challenge, we introduce an approach leveraging offline demonstration trajectories for swifter and more efficient online RL in environments with sparse rewards. Our pivotal insight involves treating offline demonstration trajectories as guidance, rather than mere imitation, allowing our method to learn a policy whose distribution of state-action visitation marginally matches that of offline demonstrations. We specifically introduce a novel trajectory distance relying on maximum mean discrepancy (MMD) and cast policy optimization as a distance-constrained optimization problem. We then illustrate that this optimization problem can be streamlined into a policy-gradient algorithm, integrating rewards shaped by insights from offline demonstrations. The proposed algorithm undergoes evaluation across extensive discrete and continuous control tasks with sparse and misleading rewards. The experimental findings demonstrate the significant superiority of our proposed algorithm over baseline methods concerning diverse exploration and the acquisition of an optimal policy.
翻译:掌握深度强化学习(DRL)在奖励稀疏的任务中极具挑战性。这些有限的奖励仅能表明任务是否部分或完全完成,因此智能体在获得有意义的反馈之前需要执行多种探索动作。因此,现有的大多数DRL探索算法难以在合理时间内获得实用策略。为解决这一难题,我们提出一种方法,利用离线示范轨迹在奖励稀疏的环境中实现更快速、更高效的在线强化学习。我们的核心见解是将离线示范轨迹作为指导而非简单模仿,从而学习一种策略,其状态-动作访问分布与离线示范的分布边际匹配。具体而言,我们引入一种基于最大均值差异(MMD)的新型轨迹距离,并将策略优化转化为距离约束优化问题。随后,我们证明该优化问题可简化为一种策略梯度算法,该算法整合了由离线示范洞察塑造的奖励信号。所提出的算法在具有稀疏和误导性奖励的广泛离散与连续控制任务中进行了评估。实验结果表明,在多样化探索和最优策略获取方面,我们的算法显著优于基线方法。