In reinforcement learning (RL) with experience replay, experiences stored in a replay buffer influence the RL agent's performance. Information about the influence is valuable for various purposes, including experience cleansing and analysis. One method for estimating the influence of individual experiences is agent comparison, but it is prohibitively expensive when there is a large number of experiences. In this paper, we present PI+ToD as a method for efficiently estimating the influence of experiences. PI+ToD is a policy iteration that efficiently estimates the influence of experiences by utilizing turn-over dropout. We demonstrate the efficiency of PI+ToD with experiments in MuJoCo environments.
翻译:在采用经验回放的强化学习中,存储于回放缓冲区中的经验会影响强化学习智能体的性能。关于这种影响的信息对于包括经验清洗和分析在内的多种目的至关重要。估算单个经验影响的一种方法是智能体比较,但在经验数量庞大时,其计算代价过高。本文提出PI+ToD作为一种高效估算经验影响的方法。PI+ToD是一种通过利用翻转丢弃(turn-over dropout)高效估算经验影响的策略迭代方法。我们在MuJoCo环境中的实验展示了PI+ToD的效率。