Current advancements in reinforcement learning (RL) have predominantly focused on learning step-based policies that generate actions for each perceived state. While these methods efficiently leverage step information from environmental interaction, they often ignore the temporal correlation between actions, resulting in inefficient exploration and unsmooth trajectories that are challenging to implement on real hardware. Episodic RL (ERL) seeks to overcome these challenges by exploring in parameters space that capture the correlation of actions. However, these approaches typically compromise data efficiency, as they treat trajectories as opaque \emph{black boxes}. In this work, we introduce a novel ERL algorithm, Temporally-Correlated Episodic RL (TCE), which effectively utilizes step information in episodic policy updates, opening the 'black box' in existing ERL methods while retaining the smooth and consistent exploration in parameter space. TCE synergistically combines the advantages of step-based and episodic RL, achieving comparable performance to recent ERL methods while maintaining data efficiency akin to state-of-the-art (SoTA) step-based RL.
翻译:当前强化学习(RL)的进展主要集中在学习基于步骤的策略上,这类策略为每个感知状态生成动作。尽管这些方法有效利用了环境交互中的步骤信息,但它们往往忽略了动作之间的时间相关性,导致探索效率低下且轨迹不平滑,难以在实际硬件上实现。情节强化学习(ERL)试图通过在捕捉动作相关性的参数空间中进行探索来克服这些挑战。然而,这些方法通常牺牲了数据效率,因为它们将轨迹视为不透明的“黑箱”。在本工作中,我们提出了一种新颖的ERL算法——时间相关情节强化学习(TCE),该算法在情节策略更新中有效利用步骤信息,打开了现有ERL方法中的“黑箱”,同时保留了参数空间中平滑一致的探索特性。TCE协同结合了基于步骤的RL和情节RL的优势,在保持与最先进(SoTA)基于步骤的RL相当的数据效率的同时,实现了与近期ERL方法可比的性能。