Most reinforcement learning algorithms take advantage of an experience replay buffer to repeatedly train on samples the agent has observed in the past. Not all samples carry the same amount of significance and simply assigning equal importance to each of the samples is a na\"ive strategy. In this paper, we propose a method to prioritize samples based on how much we can learn from a sample. We define the learn-ability of a sample as the steady decrease of the training loss associated with this sample over time. We develop an algorithm to prioritize samples with high learn-ability, while assigning lower priority to those that are hard-to-learn, typically caused by noise or stochasticity. We empirically show that our method is more robust than random sampling and also better than just prioritizing with respect to the training loss, i.e. the temporal difference loss, which is used in prioritized experience replay.
翻译:大多数强化学习算法利用经验回放缓冲区对智能体在过去观察到的样本进行反复训练。并非所有样本都具有同等重要性,简单地赋予每个样本相同权重是一种朴素策略。本文提出一种基于样本可学习性进行优先级排序的方法。我们将样本的可学习性定义为其训练损失随时间稳定下降的程度。我们设计了一种算法,优先选择具有高可学习性的样本,同时降低那些难以学习(通常由噪声或随机性导致)样本的优先级。实验表明,我们的方法比随机采样更稳健,且优于仅基于训练损失(即时间差分损失)进行优先级排序的方法,后者正是优先经验回放中使用的策略。