Importance sampling (IS) is a popular technique in off-policy evaluation, which re-weights the return of trajectories in the replay buffer to boost sample efficiency. However, training with IS can be unstable and previous attempts to address this issue mainly focus on analyzing the variance of IS. In this paper, we reveal that the instability is also related to a new notion of Reuse Bias of IS -- the bias in off-policy evaluation caused by the reuse of the replay buffer for evaluation and optimization. We theoretically show that the off-policy evaluation and optimization of the current policy with the data from the replay buffer result in an overestimation of the objective, which may cause an erroneous gradient update and degenerate the performance. We further provide a high-probability upper bound of the Reuse Bias, and show that controlling one term of the upper bound can control the Reuse Bias by introducing the concept of stability for off-policy algorithms. Based on these analyses, we finally present a novel Bias-Regularized Importance Sampling (BIRIS) framework along with practical algorithms, which can alleviate the negative impact of the Reuse Bias. Experimental results show that our BIRIS-based methods can significantly improve the sample efficiency on a series of continuous control tasks in MuJoCo.
翻译:重要性采样(IS)是离线评估中的常用技术,通过对回放缓冲区中轨迹的回报进行重加权来提升样本效率。然而,基于IS的训练可能不稳定,此前解决该问题的尝试主要集中于分析IS的方差。本文揭示了这种不稳定性还与IS的一种新概念——重用偏差有关,即因重复使用回放缓冲区进行评估与优化所导致的离线评估偏差。我们从理论上证明,利用回放缓冲区数据对当前策略进行离线评估与优化会导致目标函数的高估,进而可能引发错误梯度更新并降低性能。进一步地,我们给出了重用偏差的高概率上界,并证明了通过引入离线算法的稳定性概念,控制该上界中的某一项即可控制重用偏差。基于上述分析,我们最终提出一种新型偏置正则化重要性采样(BIRIS)框架及其实用算法,可缓解重用偏差的负面影响。实验结果表明,基于BIRIS的方法在MuJoCo的连续控制任务中显著提升了样本效率。