The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning (DRL). Previous approaches have utilized offline demonstrations to achieve impressive results in multiple hard tasks. However, these approaches place high demands on demonstration quality, and obtaining expert-like actions is often costly and unrealistic. To tackle these problems, we propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG), which leverages a small set of state-only demonstrations (where only state information is included in demonstrations) to indirectly make approximate and feasible long-term credit assignments and facilitate exploration. Specifically, we first design a trajectory-importance evaluation mechanism to determine the quality of the current trajectory against demonstrations. Then, we introduce a guidance reward computation technology based on trajectory importance to measure the impact of each state-action pair. We theoretically analyze the performance improvement caused by smooth guidance rewards and derive a new worst-case lower bound on the performance improvement. Extensive results demonstrate POSG's significant advantages in control performance and convergence speed in four sparse-reward environments, including the grid-world maze, Hopper-v4, HalfCheetah-v4, and Ant maze. Notably, the specific metrics and quantifiable results are investigated to demonstrate the superiority of POSG.
翻译:在在线深度强化学习中,奖励反馈稀疏性仍是一个具有挑战性的问题。以往的研究通过利用离线演示在多个困难任务上取得了显著成果,但这类方法对演示质量要求较高,且获取专家级动作通常成本高昂且难以实现。为解决这些问题,本文提出一种简单高效的算法——平滑引导策略优化(POSG),该算法利用少量仅含状态信息的演示(即演示中仅包含状态信息)间接进行近似且可行的长期信用分配,同时促进探索过程。具体而言,我们首先设计轨迹重要性评估机制,用于判定当前轨迹与演示轨迹相比的质量优劣;随后,提出基于轨迹重要性的引导奖励计算技术,以量化每个状态-动作对的影响程度。我们从理论上分析了平滑引导奖励带来的性能提升,并推导出性能改进的新最坏情况下限。大量实验结果表明,在网格迷宫、Hopper-v4、HalfCheetah-v4及Ant迷宫这四个稀疏奖励环境中,POSG在控制性能与收敛速度方面均展现出显著优势。此外,本文还通过具体指标和可量化结果深入验证了POSG的优越性。