We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph{${\epsilon}{t}$-greedy}, which generates exploratory options for exploring less-visited states. We prove that search using $\epsilon t$-greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emph{GDRB}, and implement \emph{longest n-step returns}. The resulting algorithm, \emph{ETGL-DDPG}, integrates all three techniques: \bm{$\epsilon t$}-greedy, \textbf{G}DRB, and \textbf{L}ongest $n$-step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.
翻译:本文研究了稀疏奖励强化学习背景下的深度确定性策略梯度(DDPG)算法。为增强探索能力,我们引入了一种搜索过程——\emph{${\epsilon}{t}$-greedy},该过程通过生成探索性选项来访问较少被访问的状态。我们证明,在温和的马尔可夫决策过程假设下,使用 $\epsilon t$-greedy 进行搜索具有多项式样本复杂度。为了更有效地利用奖励转移所提供的信息,我们开发了一种新的双重经验回放缓冲框架——\emph{GDRB},并实现了 \emph{longest n-step returns}。最终得到的算法 \emph{ETGL-DDPG} 将三种技术:\bm{$\epsilon t$}-greedy、\textbf{G}DRB 和 \textbf{L}ongest $n$-step,集成到 DDPG 中。我们在标准基准测试中评估了 ETGL-DDPG,结果表明,在所有测试的稀疏奖励连续环境中,其性能均优于 DDPG 以及其他先进方法。消融研究进一步揭示了每种策略如何单独提升 DDPG 在此场景下的性能。