In reinforcement learning (RL) with experience replay, experiences stored in a replay buffer influence the RL agent's performance. Information about how these experiences influence the agent's performance is valuable for various purposes, such as identifying experiences that negatively influence underperforming agents. One method for estimating the influence of experiences is the leave-one-out (LOO) method. However, this method is usually computationally prohibitive. In this paper, we present Policy Iteration with Turn-over Dropout (PIToD), which efficiently estimates the influence of experiences. We evaluate how accurately PIToD estimates the influence of experiences and its efficiency compared to LOO. We then apply PIToD to amend underperforming RL agents, i.e., we use PIToD to estimate negatively influential experiences for the RL agents and to delete the influence of these experiences. We show that RL agents' performance is significantly improved via amendments with PIToD.
翻译:在采用经验回放的强化学习(RL)中,存储在回放缓冲区中的经验会影响强化学习智能体的性能。了解这些经验如何影响智能体性能的信息对于多种目的具有重要价值,例如识别对表现欠佳智能体产生负面影响的经验。估计经验影响力的一种方法是留一法(LOO)。然而,该方法通常计算成本过高。本文提出了基于周转丢弃的策略迭代方法(PIToD),能够高效估计经验的影响力。我们评估了PIToD估计经验影响力的准确性及其与LOO相比的效率。随后,我们将PIToD应用于改进表现欠佳的强化学习智能体,即使用PIToD估计对智能体产生负面影响的经验,并消除这些经验的影响。实验表明,通过PIToD进行修正后,强化学习智能体的性能得到显著提升。