Importance sampling is a central idea underlying off-policy prediction in reinforcement learning. It provides a strategy for re-weighting samples from a distribution to obtain unbiased estimates under another distribution. However, importance sampling weights tend to exhibit extreme variance, often leading to stability issues in practice. In this work, we consider a broader class of importance weights to correct samples in off-policy learning. We propose the use of $\textit{value-aware importance weights}$ which take into account the sample space to provide lower variance, but still unbiased, estimates under a target distribution. We derive how such weights can be computed, and detail key properties of the resulting importance weights. We then extend several reinforcement learning prediction algorithms to the off-policy setting with these weights, and evaluate them empirically.
翻译:重要性采样是离策略预测在强化学习中的核心思想,它提供了一种策略,通过对来自一个分布的样本进行重新加权,从而在另一个分布下获得无偏估计。然而,重要性采样权重往往表现出极端的方差,在实践中经常导致稳定性问题。本文考虑了一类更广泛的重要性权重,用于纠正离策略学习中的样本。我们提出了"值感知重要性权重"的概念,该权重通过考虑样本空间来提供在目标分布下方差更低但仍无偏的估计。我们推导了如何计算这种权重,并详细阐述了所得到的重要性权重的关键性质。随后,我们利用这些权重将几种强化学习预测算法扩展到离策略场景,并进行了实证评估。