We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph{$ε{t}$-greedy}, which generates exploratory options for exploring less-visited states. We prove that search using $εt$-greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emph{GDRB}, and implement \emph{longest n-step returns}. The resulting algorithm, \emph{ETGL-DDPG}, integrates all three techniques: \bm{$εt$}-greedy, \textbf{G}DRB, and \textbf{L}ongest $n$-step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.
翻译:本文研究在稀疏奖励强化学习背景下的深度确定性策略梯度(DDPG)算法。为了增强探索能力,我们引入了一种搜索过程——\emph{$ε_{t}$-贪婪},该过程能生成探索性选项以访问较少到达的状态。我们证明,在温和的马尔可夫决策过程假设下,使用$εt$-贪婪进行搜索具有多项式样本复杂度。为了更有效地利用奖励转移所提供的信息,我们开发了一种新的双重经验回放缓冲区框架——\emph{GDRB},并实现了\emph{最长n步回报}。由此产生的算法——\emph{ETGL-DDPG},将全部三种技术:\bm{$εt$}-贪婪、\textbf{G}DRB和\textbf{L}ongest $n$-step,集成到DDPG中。我们在标准基准测试上评估了ETGL-DDPG,结果表明,在所有测试的稀疏奖励连续环境中,其性能均优于DDPG以及其他最先进的方法。消融研究进一步揭示了每种策略如何单独提升DDPG在此场景下的性能。