The utilization of the experience replay mechanism enables agents to effectively leverage their experiences on several occasions. In previous studies, the sampling probability of the transitions was modified based on their relative significance. The process of reassigning sample probabilities for every transition in the replay buffer after each iteration is considered extremely inefficient. Hence, in order to enhance computing efficiency, experience replay prioritization algorithms reassess the importance of a transition as it is sampled. However, the relative importance of the transitions undergoes dynamic adjustments when the agent's policy and value function are iteratively updated. Furthermore, experience replay is a mechanism that retains the transitions generated by the agent's past policies, which could potentially diverge significantly from the agent's most recent policy. An increased deviation from the agent's most recent policy results in a greater frequency of off-policy updates, which has a negative impact on the agent's performance. In this paper, we develop a novel algorithm, Corrected Uniform Experience Replay (CUER), which stochastically samples the stored experience while considering the fairness among all other experiences without ignoring the dynamic nature of the transition importance by making sampled state distribution more on-policy. CUER provides promising improvements for off-policy continuous control algorithms in terms of sample efficiency, final performance, and stability of the policy during the training.
翻译:经验回放机制使得智能体能够多次有效利用其经历。以往研究中,转换的采样概率会根据其相对重要性进行调整。每次迭代后为回放缓冲区中每个转换重新分配采样概率的过程效率极低。因此,为提升计算效率,经验回放优先级算法会在采样时重新评估转换的重要性。然而,随着智能体策略与价值函数的迭代更新,转换的相对重要性会发生动态变化。此外,经验回放机制会保留智能体历史策略产生的转换,这些转换可能与当前最新策略存在显著偏差。偏差越大,离策略更新的频率就越高,从而对智能体性能产生负面影响。本文提出一种新型算法——修正均匀经验回放(CUER),该算法在随机采样存储经验时兼顾所有经验的公平性,同时通过使采样状态分布更接近在线策略,避免忽略转换重要性的动态特性。CUER在样本效率、最终性能及训练过程中策略稳定性方面,为离策略连续控制算法带来了显著改进。