Existing value-based online reinforcement learning (RL) algorithms suffer from slow policy exploitation due to ineffective exploration and delayed policy updates. To address these challenges, we propose an algorithm called Instant Retrospect Action (IRA). Specifically, we propose Q-Representation Discrepancy Evolution (RDE) to facilitate Q-network representation learning, enabling discriminative representations for neighboring state-action pairs. In addition, we adopt an explicit method to policy constraints by enabling Greedy Action Guidance (GAG). This is achieved through backtracking historical actions, which effectively enhances the policy update process. Our proposed method relies on providing the learning algorithm with accurate $k$-nearest-neighbor action value estimates and learning to design a fast-adaptable policy through policy constraints. We further propose the Instant Policy Update (IPU) mechanism, which enhances policy exploitation by systematically increasing the frequency of policy updates. We further discover that the early-stage training conservatism of the IRA method can alleviate the overestimation bias problem in value-based RL. Experimental results show that IRA can significantly improve the learning efficiency and final performance of online RL algorithms on eight MuJoCo continuous control tasks.
翻译:现有的基于价值的在线强化学习算法因探索效率低下与策略更新延迟而面临策略利用缓慢的问题。为解决这些挑战,我们提出了一种称为即时回溯行动(IRA)的算法。具体而言,我们提出Q-表示差异演化(RDE)以促进Q网络表示学习,从而为相邻状态-动作对构建判别性表示。此外,我们通过启用贪婪行动引导(GAG)采用显式策略约束方法,该机制通过回溯历史行动实现,有效增强了策略更新过程。我们提出的方法依赖于为学习算法提供精确的$k$最近邻动作价值估计,并通过策略约束学习设计快速适应策略。我们进一步提出即时策略更新(IPU)机制,通过系统性地增加策略更新频率来增强策略利用效率。研究还发现,IRA方法在训练早期的保守性能缓解基于价值的强化学习中的过高估计偏差问题。实验结果表明,在八项MuJoCo连续控制任务中,IRA能显著提升在线强化学习算法的学习效率与最终性能。