Temporal difference (TD) learning is a fundamental technique in reinforcement learning that updates value estimates for states or state-action pairs using a TD target. This target represents an improved estimate of the true value by incorporating both immediate rewards and the estimated value of subsequent states. Traditionally, TD learning relies on the value of a single subsequent state. We propose an enhanced multi-state TD (MSTD) target that utilizes the estimated values of multiple subsequent states. Building on this new MSTD concept, we develop complete actor-critic algorithms that include management of replay buffers in two modes, and integrate with deep deterministic policy optimization (DDPG) and soft actor-critic (SAC). Experimental results demonstrate that algorithms employing the MSTD target significantly improve learning performance compared to traditional methods.The code is provided on GitHub.
翻译:时序差分(TD)学习是强化学习中的一项基础技术,它通过时序差分目标来更新状态或状态-动作对的价值估计。该目标通过结合即时奖励与后续状态的估计价值,提供了对真实价值的改进估计。传统上,TD学习依赖于单个后续状态的价值。我们提出了一种增强的多状态时序差分(MSTD)目标,该目标利用了多个后续状态的估计价值。基于这一新的MSTD概念,我们开发了完整的行动者-评论家算法,包括两种模式下的经验回放缓冲区管理,并与深度确定性策略梯度(DDPG)和柔性行动者-评论家(SAC)算法进行了集成。实验结果表明,采用MSTD目标的算法相较于传统方法,显著提升了学习性能。相关代码已在GitHub上提供。