Most reinforcement learning algorithms take advantage of an experience replay buffer to repeatedly train on samples the agent has observed in the past. Not all samples carry the same amount of significance and simply assigning equal importance to each of the samples is a na\"ive strategy. In this paper, we propose a method to prioritize samples based on how much we can learn from a sample. We define the learn-ability of a sample as the steady decrease of the training loss associated with this sample over time. We develop an algorithm to prioritize samples with high learn-ability, while assigning lower priority to those that are hard-to-learn, typically caused by noise or stochasticity. We empirically show that our method is more robust than random sampling and also better than just prioritizing with respect to the training loss, i.e. the temporal difference loss, which is used in prioritized experience replay.
翻译:大多数强化学习算法利用经验回放缓冲区,对智能体过去观察到的样本进行重复训练。并非所有样本都具有相同的重要性,简单地赋予每个样本同等重要性的策略是幼稚的。本文提出了一种基于样本可学习程度进行优先级排序的方法。我们将样本的可学习性定义为与该样本相关的训练损失随时间的持续减少。我们开发了一种算法,优先处理具有高可学习性的样本,同时为那些因噪声或随机性而难以学习的样本分配较低优先级。实验证明,我们的方法比随机采样更鲁棒,且优于仅基于训练损失(即时序差分损失)进行优先级排序的方法,后者正是优先级经验回放中所使用的。